AD-A261  520 


Source-leTel  Debugging  of  Automatically 
Parallelized  Programs 

Robert  Cohn 

23  October  1992 
CMU-CS-92-204 


School  of  Computer  Scieoce 
Carnegie  Mellon  University 
Pittsburgh,  PA  1S213-3890 


Submitted  in  partial /u^bnent  of  the  recpuremem 
for  the  degree  cf  Doctor  cf  Philosophy 


DSTRgUnoA  STATilCWT 
1  Ap|Mov»c  -ot  puDioc  reiecant 

I  DvtaLDUtioo 


Copyright  O  1992  Robert  S.  Cohn 


Siq)ported  in  part  by  an  Intel  Corporation  Graduate  Fellowship  and  in  part  by  the  Defense  Advanced 
Rest^h  Projects  Agency,  bifcmnation  Science  and  Technology  Office,  under  the  tide  "Research  on 
Parallel  Computing,"  ARPA  Order  No.  7330.  Work  furnished  in  connecdcm  with  the  rematch  is  provided 
undo'  prime  ccmtract  MDA972-90<r-0033  issued  by  DARPA/CMO  to  Carnegie  Mellon  University  and 
under  its  subcontract  No.  334918-S8792  with  Networks  Systenu  CotporadotL 

The  views  and  conclusions  contained  in  this  document  are  those  of  the  author  and  should  not  be 
interpreted  as  representing  the  official  policies,  either  expressed  or  implied,  of  Intel,  DARPA  or  the  U.S. 
govenunent 


C) 


1. 


93-03016 


Keywords:  Debugger,  compilo’,  paralldism,  distributed  memory,  optimization 


School  of  Computer  Science 


egie 

on 


DOCTORAL  THESIS 
in  the  field  of 
Computer  Science 

Source-leoel  Debugging  of 
Automatically  Parallelized  Programs 

ROBERTS.  COHN 


,  AcceitO'i  for  j 

pNIIS  CRA&j  J 

I  DltC  PAS  Q 

U;  i'lno  '  'Ct‘d  Q 


Oi  .I’lb.itioo / 


Submitted  in  Partial  FiilfiUment  of  the  Requirements 
for  the  Degree  of  Doctor  of  Philosophy 


ACCEPTED! 


DUC  Qu/vUlT  rr;G?ECTED  3 

Uz. 


OEPAKTMENT  HEAD 


APPROVED: 


DEAN 


DATE 


Source-Level  Debugging  of 
Automatically  Parallelized 
Programs 

Robert  Cohn 


Abstract 


I^CeUzingooinpUenaMtomaticaUytr»ria>e«iequeiaialprogfani  into  a  parallel  pro- 
gnoL  They  siiiqrfify  psnOd  prognonm^g  by  frceiDg  ibe  user  firaoi  tlie  need  to  coasider 
the  <3t  (he  panUel  archhecone  sod  the  panild  deoampondoo.  A  soanx-level 

<Miiigyyfn»Mi«i^ri<«ynypa«Ui»!HM»apnyMnt«hiite«ll>epa«»lleUmfaimtheiiaerby 

providing  the  iHusiao  that  the  origisai  sequendal  prognm  is  execiitii^ 

lb  provide  a  aoiiroe>level  view,  a  debugger  has  two  (asks.  The  fint  task  is  to  make  it 
appear  that  programs  execute  operaiiaos  in  source  Older,  lb  exploit  paraiteUsin,  compil- 
ers  relax  the  ordering  oonsti,iiitts  implied  by  the  semantics  of  tte  source  language.  As  a 
result,  opecadons  are  not  executed  in  source  order  and  when  a  user  is  debugging  a  pro¬ 
gram,  it  may  appear  that  vmlables  are  updated  out  of  order.  Sequential  language  and 
langiMgt^  with  explicit  paralleiism  share  this  problem.  Dynamic  order  restoradra  is  a 
method  for  making  it  appear  that  operaiitMis  are  executed  in  source  order. 

The  second  task  is  to  hide  the  decomposidon  of  data  and  compuiadon.  A  parallelizing 
compiler  partidons  data,  duplicates  variables,  and  changes  the  structure  of  loops.  To 
pro>^  a  source-level  view,  the  debugger  must  be  able  to  map  variables  and  statements 
in  the  source  programs  to  their  corresponding  variables  and  statements  in  the  target  pro¬ 
gram.  We  call  this  structural  mapping. 

This  thesis  describes  a  method  for  implementing  dynamic  order  restoradon  and  struc- 
tutal  mapping  in  a  debugger  for  automadcafly  pars^elized  programs.  Examples  are 
taken  firm  the  domain  of  loop-based  parallelism. 


IH 


sounce  level  OEBUtStStNO  OF  automaucally  faralleuzed  code 


tOURCe  L£VEL  OeBU(K»n  OF  iUITOMATICAlJLY 


iv 


Acknowledgments 


I  would  like  (o  t&aiik  my  advisors  'nuunas  Gross  and  H.  T.  Kimg.  Widiotit  Tbomas’  iup- 
post  and  irdmifal  feedback  1  could  not  have  finwbert  tfais  tbesis.  I  am  grateful  to  Kimg 
for  asking  ibe  tough  quesdoos  diat  kqjt  my  work  txmesc 

I  would  also  Hke  to  thaidt  the  members  (»r  my  thesis  committee.  My  discussions  with 
Bemd  Bniegge  improved  the  technical  coutent  and  presentation  of  the  thesis  md  David 
Padua  gave  me  a  valuable  outside  perspective  on  my  work. 

I  woukllikBtotbaDkmyfiriendsatCMU  for  making  my  time  here  so  enjoyable.  Their 
friendship  and  support,  especially  in  the  final  months  of  thesis  was  a  great  help.  Special 
thanks  go  to  Mkhaei  Hemy,  who  carefully  read  my  thesis. 

I  would  like  to  thank  my  brother  Richard,  who  brought  me  to  CMU  originally,  and  also 
my  parents  and  sister. 


SOURCS  usm  OCBUOOma  OP  AimMAIlUALLY  PARALLEUZED  CODE 


V 


Table  of  Contents 


CHAPTERl 


CHAPTER! 


CHAPTERS 


CHAPTER  4 


CHAPTERS 


Introductioii  1 

1.1  DdMgging  iMuea:  an  example  debugging  session  2 

1J2  Tbe  gap  between  p«ngr^«nming  models  and  aachitectures  S 

13  Overview  of  tbe  appioach  7 

1.4  Related  work  8 

13  This  presentation  10 

Background  13 

2.1  Sequential  and  parallel  programming  models  13 
12  Parallelizing  compiler  14 

23  Semantics  of  source  level  debugging  IS 

2.4  Summary  22 

Approach  23 

3.1  Problem  scope  23 

33  What  a  ddwigger  for  paralleliziaion  must  do  24 
33  Debugger  methodology  28 

3.4  Summary  30 

The  thread  splitting  transformation  3 1 

4.1  The  thread  flitting  transformation  32 
43  The  Ofimctioa  for  thread  splitting  40 
43  PaaOel  execution  while  debugging  4S 

4.4  EfiBciency  55 
43  Related  work  56 
4.6  Summary  58 

Distribution  transformations  59 

5.1  Parallelizing  compilers  60 
53  Domain  description  61 

53  Basic  loop  iteration  (Ustribution  62 

5.4  CommunKation  73 


SOURCE  LEVEL  DBNJQQINQ  OF  AUTOMATICAULy  PARALLELIZED  CODE 


SJS  DaudistnlNMioB  TJ 

5.6  Cyclic  iteiadoo  distribudoiu  82 

5.7  Summary  87 


CHAPTER6 


CHAPTER? 


CHAPTERS 


A  debugger  for  a  compiler  with  block  and  cyclic  distributions  89 

<S.l  Matrix  multiply  89 

6.2  Cooipilatiaa  92 

CooipUaMebuggef  mccrface  97 

6.4  Ddmgger  99 

6.5  Perfonnance  ofblock  and  cyclic  distributiooswben  debugging  103 

6.6  Sismnary  106 

Limitations  of  thread  splitting  107 

7.1  Data  dependent  and  indepoidentassignmenis  108 

7.2  Why  ptograms  with  dam  independent  assignments  can  be  less  efficient  109 

7.3  Generating  a  p  program  112 

7.4  Estimating  the  overhead  of  data  independent  assignments  113 

1.5  Summary  118 

Conclusions  121 

8.1  Ddwggability  of  distributions  121 
92  Building  ddwggers  122 
8J  Coniributioos  122 
8.4  Future  work  123 


SOURCE  LEVEL  DQlMGaNa  OF  AU10IIAT1CAU.Y  PARALLELIZED  CODE  viii 


CHAPTER  1 


Introduction 


Progianuning  parallel  macfaines  is  mudi  more  difScult  tlian  programming  sequential 
machines.  Writing  a  parallel  program  requires  that  the  programmer  divide  the  computa¬ 
tion  among  a  set  of  processors  so  that  the  load  is  equally  balanced,  add  synchronization 
if  the  target  architecture  is  a  shared  memory  machine,  and  add  communication  if  the  tar¬ 
get  is  a  distributed  memory  machine.  Writing  a  program  that  is  optimized  for  a  particu¬ 
lar  architecture  makes  these  tasks  especially  difficult 

Parallel  program  generators,  also  called  parallelizing  compilers,  can  ease  the  program¬ 
mer’s  bi^n  by  automating  some  or  all  of  this  work.  Some  compilers  provide  a  pro¬ 
gramming  model  that  allows  the  programmer  to  express  the  algorithm  in  a  manner  that 
is  independent  of  the  number  of  processes;  the  compiler  decides  how  to  best  distribute 
the  omnputation.  For  distributed  menmty  machines,  compilers  can  provide  a  shared 
memmy  programming  model  and  automatically  distribute  data. 

Compilers  may  make  writing  programs  easier  but  do  not  solve  the  whole  problon — 
codmg  is  just  one  step  in  creating  a  new  program.  In  particular,  debugging  can  be  a  vay 
time  consuming  and  tedious  process.  Debugging  an  automatically  parallelized  program 
is  difficult  because  the  user  did  not  write  the  taiget  program  that  actually  executes. 
Undmstanding  how  the  program  works  would  require  that  the  user  know  aU  the  details 
of  the  parallelization  that  the  compiler  was  trying  to  hide. 

My  thesis  is  that  a  debugger  for  autommically  generated  parallel  programs  can  provide  a 
sequential  view  of  parailei  execution.  In  other  words,  paialleiizing  compilers  can  have 
source-level  debuggers  [Pineo  91][Cohn  91].  The  user  can  set  breakpoints  and  inspea 
variables  as  if  they  are  debugging  their  source  program,  even  though  a  parallelized  pro¬ 
gram  is  executing. 


SOUmS-UEVEL  OEBUCMMK  J  OF  AUTOMA'nCAUY  PMAULEUZED  PfKlGRAIIS 


1 


llte  nM  of  tbe  d»{XBr  is  of^sanized  »  folk>ws,  b  Secdan  1 we  illaatnue  some  of  tbe 
it8aesi]isoiifce-leveidebugsingwdbaoexaiiipie.Adesoriptioaofthevanouspn}- 
ymminfl  anrf  wMchina  mndeU  wnnlfiyed  for  parallel  macAmeat  and  ttieir  impact  on 

debugging  cm  be  found  in  Section  IJZ.  Section  1 J  nmoduces  die  approach  of  this  the¬ 
sis  for  building  debuggers.  Secdce  1.4  discusses  some  related  woric,  and  Seaim  1.5 
outlines  the  presentaticn  of  the  thesis. 


1.1  Debugging  Issues:  an  ^xampla  debugging  session _ 

Ihere  is  a  wide  ^KChum  of  programming  modeb  aid  machine  aichitectures.  Program¬ 
ming  models  and  architectures  range  from  sequential  to  MIMD  parallel  and  from  shared 
ID  distributed  memofy.  Each  cambinatkn  of  programming  model  and  architecture  has 
dUTerent  lequtrements  for  source-level  debugging.  To  introduce  the  main  issues  m 
implementing  source-level  debugging,  we  will  use  a  simple  example  where  the  {Hd- 
gramming  model  is  sequendal  and  the  actual  machine  is  a  MIMD,  distributed  memory 
machine. 


FIGURE  1- 1  depicts  a  sequential  program  that  initializes  all  the  elements  of  the  array  A 
to  1  and  a  two  processor  parallel  program  generated  by  a  compiler.  In  the  parallel  pro¬ 
gram,  processor  0  executes  aU  the  even  imratioas  of  the  originatl  loop  and  processor  1 
execur.iS  all  the  odd  itermions.  On  a  distributed  memory  machine,  the  data  must  be  par- 
titioiied.  For  this  program,  processor  0  bas  all  the  elements  of  A  with  even  indexes  and 
processor  1  has  all  the  elements  of  A  with  odd  indexes. 

The  main  issues  that  we  address  m  this  thesis  are  setting  breakpuints,  examining  and 
modifying  variables,  and  repoting  the  current  locaiioa  of  the  program  counter  in  the 
program.  ^  use  the  above  program  to  give  examples  of  tbe  problems  associated  with 
these  issues.  Assume  that  tbe  user  starts  execatkn  of  the  program  and  after  some  time, 
the  interrupts  the  program.  As  depicted  in  FIGURE  1-2,  processor  0  is  executing  the 
loop  control  of  the  for  statement  and  processor  1  is  executing  the  body  of  the  loop.  In 
the  figure,  the  stop  signs  indicate  the  source  lines  that  each  processor  will  execute  next 
The  dagger  must  report  one  statement  in  the  source  progran  as  being  tbe  curroit 
locatioo.  To  determine  this,  the  debugger  must  know  the  relatioaship  b^een  lines  in 
tbe  source  program  and  lines  in  tbe  parallel  program.  In  this  example,  the  correspoD- 
denoe  is  simple,  but  it  can  be  compUcated  when  the  compiler  applies  transfotmatioiis 
that  insett  and  delete  lines  and  testnicture  the  program. 

Next,  the  user  tries  to  inspea  the  value  of  array  element  A[2] .  The  aray  A  has  been 
divided  into  two  smaller  arrays  and,  as  shown  in  FIGURE  1-3.  Array  eionent  A  [2 1 
resides  co  processor  0  and  is  called  AO  C 1  ] ;  tbe  debugger  must  know  the  relatioaship 
between  data  in  the  source  program  and  data  in  the  taiget  program.  This  relationship  is 
not  necessarily  static.  For  example,  each  processor  has  its  own  copy  of  tbe  lo(^  counter 
1  and  choosing  whidi  copy  is  the  coma  one  to  display  is  dependent  on  the  state  of  tbe 
program. 

Before  actually  displaying  the  value,  tbe  debugger  must  be  sure  thm  tbe  value  that  is 
currently  in  memory  is  tbe  same  tme  tbe  user  would  have  seen  if  the  source  program 
were  executing.  If  processor  0  has  execued  one  itBratum  of  its  loop,  and  processor  1  has 
executed  three  iterations,  then  the  memory  of  each  processor  would  appear  as  is  sbown 


2 


SOURCe-UEVEL  OEBUQGVfO  OP  AUTOMAUCAtXY  PARALLEUZEO  PnOGRAMtt 


rVifeugsing  m  •ompte  dabufMliiMI 


FKaUflE  1*1  SoqucntM  program  and  it*  p«raM  varsion 


Sequeatial  source  program 


int  i; 
int  AC30] ; 

for  (i  a  0;  i  <  30;  i+f) { 
Ati]  a  1; 

) 


ProcessorO 


Processor  I 


int  iO; 
int  AO [ 15] ; 

for  (iOa  0;  iO  <  30;  i0+a2) { 
A0(i0/2]  a  1; 

} 


int  il; 
int  A1 [15] ; 
for  (ila  1;  il  <  30; 
Al[il/2]  a  1; 

} 


il-Ka2)  ( 


RGURE  1-2 


When  tbe  user  abwts  tlie  progr^  processor  0  is  stopped  at  tbe  f  or  statement  and 
processor  1  is  executing  tbe  assignment 


ProcessorO 


Processor  1 


int  iO; 
int  AO [15] ; 
for  (iO  a  0;  iO  <  30; 
A0(i0/2]  a  1; 

) 


i0+a2 


int  il; 
int  A1 [ 15]  ; 
for  (il  a  1; 
Al[il/2] 


il  <  30; 
a  1; 


il+=2)  ( 


in  FIGURE  1-4.  If  we  merged  botta  memories  to  produce  tbe  memory  of  tbe  source  pro¬ 
gram,  then  it  would  contain  tbe  values  as  shown  in  the  right  side  of  die  figure.  This  state 
clearly  cannot  be  tbe  result  of  executitm  of  tbe  sequential  program.  Tbe  loop  in  the 
source  program  walks  tbrougb  the  array  A  sequentially,  but  array  element  A  [3  ]  has 
been  initialized  while  element  A  (2  ]  has  not  The  debugger  must  detea  when  such  an 
inconsistency  has  occurred  so  that  it  does  not  give  tbe  user  any  misleading  infonnation. 


SOURCE-LEVEL  06BUOOMQ  OP  AUTOMATICALLY  PARALLEUZED  PROORAMS 


3 


IntroduetkMi 


HQURE  1-3  Th«  rviatiorartijp  o<  nwTM  o(  vwiabiM  in  th«  «ouic«  and  targt<  program* 


ProcessorO 

memory 


Inttiiscase,iBlluigtbeuserthatA[2]  has  a  value  of  0  and  A  [3]  hasavalueof  1  is 
tni«i<^ing  because  the  programmer  might  incomctly  conclude  that  the  program  failed 
to  initialize  element  A(a] .  bi  this  situatioo.  one  possible  response  for  the  debugger 
could  be  that  A  (21  has  a  value  of  0  and  that  A  (3]  andA[51  cannot  be  examined. 

RGURE  1*4  The  state  of  memory  after  processor  0  has  executed  1  iteration  and  processor  l  has 

executed  3  iterations 


Sequential  program 
memory 


Setting  breakpoints  can  have  similar  problems.  In  our  example,  iteration  3  of  the  source 
program  has  already  executed,  but  iteration  2  has  not  yet  executed.  If  the  user  sets  a 

SOURC64JEVELD6BUQCaNq  Off  AUTOMATICALLY  PARAi  m  tTFP  PRO  2WAIIS 


Th«  ■•mutk  a«p  b«tw— w  pfogrtwniog  mqd«l«  mnd  Tchifctuf— 


breakpoiu  in  tbe  kxip,  the  prognm  readies  the  bfcakpoint  for  iteraiioiis  2  and  4  has 
already  lun  past  the  breakpoim  for  iteradoo  3.  In  this  utuatreo,  the  debugger  should 
warn  the  user  that  some  breakpoints  might  be  skipped  When  breakpoints  occui;  tbe 
ddHtgger  should  present  them  to  the  user  in  the  anlet  that  they  would  have  occurred  in 
the  seqaeadai  propam.  even  iho<^  they  may  occur  out  of  source  Older. 

We  classify  the  probiems  described  above  into  two  caregories,  dynamic  order  restora- 
don  mid  stnicaiiofmappwg.  These  are  the  fundamental  services  that  a  debuggg  for  par- 
aUelized  propam  must  provide.  This  thesis  describes  how  to  impiement  these  services. 

Dynamic  order  restoration  ensures  that  the  observable  order  of  execution  of  operations 
is  the  same  as  in  the  source  program.  If  tbe  semantics  of  tbe  source  language  imply  that 
operation  A  is  executed  before  operation  B.  and  in  the  target  program  cqieration  A  can 
be  executed  before  or  after  operation  B,  then  a  debugger  that  does  dynamk:  order  resto¬ 
ration  bdiaves  as  if  operation  A  is  alv/ays  executed  before  B. 

Parallelizing  compilers  relax  the  ordering  constraints  implied  by  tbe  semantics  of  the 
source  language  to  increase  parallelism.  Tbe  semantics  of  a  sequential  program  imply 
that  only  one  operation  at  a  time  is  executed;  a  compiler  converts  a  sequential  prt^ram 
into  a  parallel  program  where  more  than  me  operatim  can  be  executed  simultaneously 
Compilers  remove  ordering  constraints  even  when  the  source  language  has  explicit  par- 
allelism.  We  explain  this  in  more  detail  in  Sectim  1.2. 

If  operations  are  executed  out  of  source  Older,  then  as  described  in  the  example  of  FIG¬ 
URE  1-4,  variables  may  be  updated  out  of  order.  A  debugger  that  does  dynamic  order 
restoratim  must  make  it  appear  that  they  are  updated  in  order.  This  may  include  pre¬ 
venting  the  user  firan  examiiuog  variables  at  some  points  in  tbe  program.  Another  pos¬ 
sible  result  of  out  of  order  execution  is  that  breakpoints  may  not  occur  in  source  order. 
Tbe  ddwgger  must  present  tbe  breakpoints  to  tbe  user  in  source  order. 

Structural  mapping  relates  tbe  source  lines  and  variable  names  in  tbe  source  and  t^et 
programs.  In  tbe  previous  example,  detennining  that  array  element  A  [3  ]  resides  in  the 
memory  of  processor  1  and  that  it  is  called  A1  [1]  is  souctural  mapping. 

Structural  mapping  is  also  necessary  in  debuggers  for  sequential  machines;  relating 
source  code  md  machine  code  is  an  example.  In  addition,  programs  can  be  restructured 
by  optimizing  compiles.  Dynamic  oidm’  restoration  is  unique  to  debuggers  for  parallel¬ 
izing  compilers  and  requires  new  mechanisms;  tbe  debugger  must  be  able  to  determine 
the  order  in  which  operations  executed.  Tbe  nature  ttf  dynamically  reordered  operations 
also  provides  for  flexibility  that  is  not  present  for  sequential  programs.  If  a  constraint  is 
removed  and  two  operations  can  occur  in  order  or  out  of  order,  then  the  debugger  can 
constrain  executim  so  that  they  occur  in  order. 


1.2  The  semantic  gap  between  programming  models 
and  architectures 


The  difficulty  of  source-level  debugging  depends  upon  tbe  gap  between  tbe  program¬ 
ming  model  and  the  code  that  executes  on  tbe  parallel  machine.  This  is  in  dirn  depen- 


SOURCS-UEVEL  OCBUOQMQ  OF  AUTOMA'nCAU.Y  PARALLELIZED  PROORAMS 


5 


wiiumiciioH 


dent  ai  ifae  rmanfirt  of  the  soiiree  language,  ilie  anAtocmre  of  (lie  pM«tki  macfaine, 
aod  the  way  the  oooipiler  hnpiemenB  the  source  language  on  the  machine.  All  cootpii- 
en  create  dm  need  for  stnictiHal  mapiring.  Some,  bin  not  all.  need  dynamic  order  resto- 
ladoii.  hi  this  secdoo.  we  sketch  out  the  spectnin  of  prc^ramming  models  provided  by 
parallelizing  compilers  and  discuss  what  proUems  are  introduced  when  they  are  inqile- 
»!«*"«»**  on  dUEerent  architectures. 

The  progremmhig  model  that  the  source  language  allows  can  be  very  difbreat  horn  the 
archiiBcture.  i¥am  a  ddmgging  standpoint,  the  two  moat  important  components  of  the 
model  are  the  user  view  of  procesaors  and  memory. 

hi  a  programming  model  with  a  low  level  (rf  abstracdon,  the  user  must  explkidy  write  a 
program  for  each  processor  in  the  sysnan.  COmpilen  can  lift  the  level  of  abstracdoo  by 
adding  coosmcts  that  direcdy  support  pataUelism.  The  eobnala/eownd  and  do«ll 
are  two  examples.  The  semantics  of  tfa^  constructs  is  that  there  is  only  cme  thread  at 
the  beginoing  of  the  construct,  one  thread  is  spawned  to  execme  each  clause  in  the 
aobrngiaieomoA  or  each  iteradoo  of  the  doall.  After  all  the  threads  have  completed 
their  assigned  work,  there  is  a  join  and  only  one  thread  continues  execution.  Data  pmai- 
lel  operBdons  are  similar,  there  is  one  thread  of  control  before  the  operadon  begins,  the 
dataporalki  opetadon  is  executed  In  paraUel.  and  there  is  one  thread  after  the  operation 
conqiletes.  TIim  constructs  hide  the  architecture  of  the  parallel  machine  and  the  details 
of  die  mapping.  The  user  does  not  need  to  know  how  many  processors  there  are. 
irtietbef  the  architecture  is  SIMD  or  MIMD,  and  does  not  n^  to  decide  which  opera- 
dons  are  executed  on  each  processor.  The  programming  model  can  be  moved  one  step 
fhnher  from  the  archiiecture  by  providing  a  sequential  model  of  ccmpuiadon.  In  this 
case,  the  user  does  not  specify  tte  parallelism---die  compiler  automatically  detects  it 

The  user  view  of  memory  in  programming  models  ranges  from  distributed  memory, 
where  a  processor  may  only  access  dam  in  its  local  memory,  to  shared  memory,  where  a 
processor  may  access  all  data  uniformly.  For  programming  models  with  distributed 
memory,  data  can  only  be  moved  between  processors  when  the  originating  processor 
sends  the  data  aod  the  desdnadon  processor  receives  it  Other  systems  provide  program¬ 
ming  models  that  are  a  hybrid;  dam  can  be  fetched  frran  another  processor's  memmy 
without  the  owner  sending  it  or  dam  can  be  put  in  another  processofs’s  memcwy  without 
the  desdoadou  processor  receiving  it 

A  semantic  gap  exists  when  the  programming  model  doesn’t  match  the  architecture. 

The  test  of  this  secdou  explains  which  combinadons  of  programming  models  and  atcbi- 
iBctiires  require  structural  mapping  and  dynamic  order  restoradoo.  If  the  compiler  pro¬ 
vides  a  shared  memory  model  on  a  distributed  memory  machine,  then  the  debugger 
must  also  provide  the  illusion  that  there  is  a  single  address  space  by  translating  the  vari¬ 
able  names  and  array  subscripts  in  the  source  program  to  names  and  subscripts  in  the 
paraUel  program. 

There  is  a  close  match  between  a  SIMD  architectuR  and  languages  with  dam  paraUel 
opetadrais;  the  basic  opetadons  of  tbe  macbine  are  element-wise.  Executing  a  sequen¬ 
tial  language  or  a  language  with  parallel  loops  oo  a  SIMD  machine  requires  using  the 
vectorizadoQ  techniques  of  strip  mining,  loop  distribmiaa,  etc.  These  techniques  intro¬ 
duce  some  structural  changes  but  since  the  target  architecture  has  a  single  thread  of  coo- 
trol,  no  dynamic  order  restoradoo  is  requited. 


JOURCC-UEVB.  DEBUQCaNQ  OF  MmMIATICmY  PARAUJELJ2ED  PnOGRAin 


Ovwviaw  el  th*  appiMeh 


A  KflDMD  sfcbiiBGaBe  creaiBt  tbe  seed  for  Older  itsiaraiioii.  If  a  prognBUBiDg  oiodd 
wMiex|dldtpraGessesisimplenieniedoBaMIMDiiMCliiiie.itieotbeteislialediffer- 
encebttween  tile  progranuning  model  and  arctutecture.  However,  if  a  restticted  parallel 
model,  sodi  as  parallel  ]oops{Poly  89](B8lasuiidaram  89][Metirotn  90]CSii8sinao  91] 
ordaapandIelmodei(CliaiiBjee  91][Fortraad  92]IKiiobe  90]  is  implgmenad  oe  a 
MIMD  macbtoe,  tiiea  tiiere  is  a  potential  for  dyoamic  leoidering  to  oocto;  depending  00 
the  hnplemeniatioa  A  stttightforwaid,  but  inefficient  way  to  mpkment  a  puallei  loop 
is  to  spawn  off  processes  at  the  beginning  of  die  loop,  have  each  loop  execute  some 
die  iterations,  and  then  kill  all  but  a  master  process  after  the  loop  tenninaiBS.  If  a  panUel 
loop  is  imptemenied  in  this  manlier,  die  semantics  of  tbe  source  language  and  die  actual 
execution  are  a  close  match.  However,  most  compilers  use  a  mere  efficient  method  for 
implementing  a  parallel  loop.  AH  tbe  processes  execute  the  sequential  code  outside  tbe 
parallel  loop,  each  process  executes  some  of  the  iterations  o(  the  loop,  and  all  processes 
continue  execution  after  the  loop  terminatesfBooth  86][Cytron  90]lciiatiefjee 
91](Tseng  89],  Tlie  processes  do  not  necessarily  synchronize  at  the  beginning  or  end  of 
the  loop,  unless  data  dependences  require  it.  This  implementation  is  more  efficient 
because  the  program  does  not  need  to  spawn  ^  kill  processes,  and  the  lack  of  b^er 
synchronizations  can  reduce  idle  time.  With  such  an  implementation,  it  is  possible  for 
one  processor  to  be  executing  code  before  the  beginning  of  the  loop  while  another  ]hx>- 
cessor  is  executing  iterations  of  tbe  loop.  If  the  program  were  to  stop  in  that  state,  the 
order  o(  execution  would  violate  the  semantics  of  the  language,  which  requites  dynamic 
ordm' testofatkm. 

Executing  data  parallel  operations  on  a  MIMD  machine  is  similar  to  executing  parallel 
loops;  if  a  barrier  synefarmuzation  is  done  before  and  after  every  data  parallel  operation, 
then  the  semantics  of  the  language  aid  the  actual  executitm  are  a  close  match.  If  tbe  syn¬ 
chronization  is  not  needed,  then  it  can  be  removed  to  make  the  program  more  efficient 
When  tbe  synchronization  is  removed,  two  processors  can  be  executing  different  opera¬ 
tions  at  the  same  timefChalteijee  91],  which  violates  the  semantics  and  creates  a  need 
for  dynamic  order  restoratioa  in  tbe  debugger. 

To  summarize,  structural  mapping  is  always  needed.  Tbe  ddnigger  must  associate  state¬ 
ments  in  the  source  md  target  programs.  If  the  compiler  provides  a  shared  memory 
model  on  a  distributed  memory  machine,  then  the  debugger  must  also  (Hovide  that 
model.  When  the  taiget  program  does  not  obey  the  ordering  constraints  of  the  source 
program,  as  is  the  case  for  data  parallel,  parallel  loops,  and  sequential  models  imple- 
mented  on  MIMD  machines,  tfaa  tbe  dagger  must  also  provide  dynamic  order  tesio- 
lation. 

The  biggest  gap  between  source  program  and  execution  occurs  when  a  single  address 
space,  sequential  program  is  executed  on  a  distributed  memory  MIMD  machiiie.  This  is 
tte  ouun  focus  of  our  thesis.  However,  tbe  methods  explained  in  die  thesis  apply  even 
when  the  programming  model  is  not  sequential  with  a  single  address  space. 


1.3  Overview  of  the  approach _ _ _ 

The  principal  insight  of  our  approach  is  that  tbe  structural  aspects  of  parailelization  can 
be  express^  in  the  context  of  a  program  which  has  not  been  parallelized.  For  example, 
in  FIGURE  1-S  we  have  rewritten  the  sequential  program  in  FIGURE  l-I  as  another 


SOURCE-LEVEL  OEBUOOMa  OP  AUTOMAUCALLY  PARALLELIZED  f^ROORAMS 


7 


seqncsdal  progim  wbere  die  distritMkn  of  Ji  into  two  smaikr  anmys  aad  die  (faipiki- 
doo  of  die  loop  oounier  1  is  exposed  but  tbe  perdlelism  is  hoc  visible. 


FIGURE  A  rseduetured  venton  a4  the  sequential  program  in  FIGURE  1  -1 . 

e> 

int  10,13.; 

Int  AOtlS],Al[lS]; 

for  (10  «  0,11  »  1;  10  <  30;  10  ♦«2,ll+-2)( 
AO [10/2]  «  1; 

Al(ll/2]  «  1; 

} 


Our  metbodology  for  coastnicting  debuggers  separates  tbe  dynamic  ordering  and  struc¬ 
tural  mapping  issues  by  breaking  the  parallelization  transformaticn  of  compilatioo  into 
two  ptaa^  Instead  of  translating  the  program  diitcdy  from  sequential  to  parallei,  as  is 
dejncted  in  the  top  of  FIGURE  1-6,  tbe  compiler  uses  tbe  2  pb^  process  shown  in  tbe 
bottom  of  FIGURE  1-6.  The  first  phase  is  called  distribudon  and  exposes  the  restructur¬ 
ing  necessary  for  parallelizatioa.  It  is  a  sequendal  to  sequential  program  ttansfonnadoo. 
and  its  output  is  the  P  program.  The  second  phase  is  a  paialleii^oa  transformation 
called  timad  spUaiag  and  its  output  is  the  parallel  oa  program.  Thread  splitting  is  a  sim¬ 
ple  and  general  transformatioa  which  extracts  multi|de  threads  of  ooniiol  Grom  asingk 
one.  A  debugger  is  coosauciBd  by  buildiag  a  debugger  for  disiribudon  and  another 
debugger  for  thread  splitting.  The  debugger  for  the  origiiial  ttansfonnatioo  can  then  be 
built  by  composing  the  two  debuggers.  The  source-levd  for  the  debugger  for  the  distri¬ 
bution  transfonnidioo  is  the  a  program  and  the  source-level  for  the  thread  splitting 
debugger  is  tbe  P  ptogram. 

Structural  mapping  is  only  done  is  the  debugger  for  disiributioii.  Dynamic  order  restota- 
turn  is  only  dow  in  the  dagger  for  thread  splitting.  To  tmplemeid  a  difiEerem  paraUel- 
izhig  transfonnadoo,  we  change  tbe  disiribotian,  but  tbiead  spliiting  is  tbe  same.  Since 
thread  s|ditdng  is  tbe  oidy  tramformadon  ibat  mtroduces  parallelizatioQ  in  the  system, 
its  debugger  can  be  reused  for  aO  panddizing  compilers.  Id  effect,  tbe  part  of  tbe 
ddMgger  that  manages  parallelism  is  isoiatedand  is  independent  of  tbe  parallelizing 
transfonnatioos  that  a  compiler  uses. 


1.4  Related  work _ 

The  two  main  features  that  are  present  in  parallel  program  debt^ging  but  are  not  in 
sequential  program  debugging  ate  noDdeterminism  and  the  extra  infonnadon  related  to 
multiple  threads  of  control.  Nondetenninism  makes  debugging  difficult  because  incor¬ 
rect  behavior  might  not  be  reproducibie.  Multiple  ttareads  of  control  make  ddnigging 
more  difficult  because  the  user  must  filler  through  extra  information  when  debugging 
(e.g.  breakpmnis  can  occur  on  more  than  one  processor  at  the  same  time). 


8 


SOURCC-lffVeLDCBUCKaMOOFMnOIIATlCAU.Y  FMnJLELOEO  PROQRAIIS 


Ovwvtow  of  tiM  approach 


FIGURE  1-« 


Compitaf  tranatormation  saquantial  to  paraHal 


To  solve  the  first  pn^lem,  lesearcbers  have  worked  oa  systems  to  detect  noodetermin- 
ism  when  it  is  an  undesirable  property  of  a  program  and  to  ctMnrol  it  when  it  is  neces¬ 
sary.  Detecting  nondetenninism  usually  leUes  on  a  combination  of  static  analysis  of 
programs  and  run-time  cfaeddng  [Netzer  911[Callahan  90][Emratb  89][Wang 
90][Dinning  91].  These  tools  are  intended  for  shared  memory  machines  aid  check  for 
unsyncfafonized  accesses  to  shared  variables.  In  our  work,  we  assume  that  the  program 
has  been  automatically  parallelized,  so  nondetenninism  of  the  type  detected  by  these 
debugging  tools  cannot  occur  if  the  compiler  is  correct  It  is  possible  however,  that  the 
user  can  mistakenly  diieg  the  compiler  to  paraHeltze  a  loop  that  cannot  be  correctly 
executed  in  parallel,  which  could  create  a  noodeteiministic  program.  Nothing  prevents 
the  user  Cram  employing  other  tools  in  conjunction  with  a  source-level  debugger  in  this 
case.  Once  all  synchronizatitm  bugs  have  been  removed,  source-level  debugging  can  be 
used. 

If  a  program  is  intentionally  nondeterministic,  then  debugging  is  difficult  because  tte 
problem  may  not  be  reproducible  when  using  the  debugger.  To  solve  this  problem, 
researchers  are  studying  bow  to  make  the  behavior  reproducible,  either  by  restricting 
execution  so  that  it  is  refHodudble,  debugging  Cram  traces  of  a  single  run,  or  a  combina¬ 
tion  of  both(Fbtin  88]  (Miller  88][Ld}ianc  87][Toimacb  91][Bacon  91]. 'WTitba 
MIMD  execution  modd,  many  diCTerent  possible  interleavings  of  instruction  executions 
are  possible  even  for  a  program  with  deterministic  results.  This  can  be  viewed  as  a  type 
of  nondetetminism  since  the  interleaving  varies  Crom  execution  to  execution.  By  provid¬ 
ing  a  source  view  of  execution,  we  are  in  effect  choosing  a  single  interleaving  as  the 
correct  one.  In  this  respect  our  work  is  similar  to  Iblinachrroltnach  91],  who  makes  a 
program  deterministic  by  imposing  a  total  orderii^  on  access  to  shared  objects,  but 
optimistically  allows  the  program  to  run  in  parallel.  RoQ  back,  or  restoring  the  program 
to  a  previous  state,  is  used  when  the  optimistic  parallel  execution  violates  the  chosen 


SOURCe-UEm  OEBUCMUNO  OF  MITOaM11CAU.Y  PARALLE1J2EO  PROORAMS 


9 


anlering.  (>v  woric  diilim  because  we  give  tbe  appearaaoe  of  a  lotti  oideriog  or  aocm 
ID  oA  object!  in  the  prognm  but  do  not  oeceasarily  lestria  execatkn  to  die  (ocil  onJer- 
ing^  tf  Ibtanacii’s  method  were  used  to  impoee  a  total  oidering  for  all  objects,  there 
could  not  be  aayparaUeHaa  during  execution. 

The  second  ftamre  that  dlatliqptishea  parallel  prognnn  ddNigging  from  sequential 
dehuggiag  is  the  exiia  information  relaied  to  having  multiple  threads  of  coiuroL  For 
exsnqrie,  a  sequential  machine  caa  only  have  one  breakpoint  at  a  time,  but  on  a  parallel 
nmdtine.  every  processor  can  hit  a  dUferent  bicdtpoint  at  the  same  time.  Parallel 
machinM  can  also  generate  mote  information  in  the  same  time  than  sequential 
machines.  To  solve  this  probkm.  resemchen  have  studied  ways  to  fiber  and  present  the 
infarmation  that  is  preseoaed  to  the  usee.  Visualizatioo  and  MmuiMrinw  have  been  used 
to  graphlcaily  present  information  about  the  sauus  of  processQfs(Heaih  9l]CBmley 
88]tZeraik  911GFrandcni  91].  Another  way  to  reduce  the  amount  of  information  that  a 
user  most  see  is  to  test  for  high  level  events  aaaodated  with  the  behavior  the  program 
rather  than  low  level  events  such  as  source  line  breakpoints.  Researchers  have  studied 
how  to  detect  and  report  events  promptly  without  dismrbing  the  state  of  the  system 
(Bruegge  91][Aral  88][Baies  83]. 

By  providing  a  sequential  view,  we  also  filter  out  unnecessary  information.  Commimi- 
cation  and  syncfaionizatioo  ate  compiler  generated  and  ate  not  relevant  to  debugging 
and  are  not  visible  to  the  user.  Instead  of  multiple  program  counters,  there  is  a  single 
program  counter.  We  replace  the  parallel  program  with  an  absmction,  the  source  pro* 
gram. 

The  work  that  is  the  most  closely  related  to  ours  is  the  debi^ging  (Mf  optimized  code 
(Ifennessy  82](Coppennan  90][Cootant  88][Ze0weger  84]  and  the  debugging  of  par¬ 
allelized  code  by  Gi^  [Gupta  88]  and  Pfneo  [Pfneo  91].  Debugging  optimized  code 
is  similar  because  it  requires  stractural  mapping;  however  dynamic  order  restoration  is 
not  present  in  debuggen  for  optimized  cote  because  there  is  no  parailelLm  The  differ¬ 
ences  between  our  work  and  previous  work  oo  debugging  paialldized  cote  is  explained 
in  detail  in  CHAPTER  4. 

An  aiteniative  to  the  approach  we  pursue  for  debugging  parallelized  programs  is  to 
compile  the  sequential  source  program  that  the  user  wrbes  for  a  single  processor,  and 
use  conveational  single  processor  ddragging  toois[Cft  90]rneiig  89].  This  approach 
does  not  work  when  the  execution  envtronment  on  the  sequential  and  parallel  machines 
are  not  identical,  when  the  progimn  uses  more  memory  than  can  be  accessed  by  a  single 
processor,  and  when  the  program  runs  too  slowly  on  a  tingle  processor. 

Another  approach  is  a  compromise  between  providing  source-level  ddwgging  and  mak¬ 
ing  the  user  debug  the  paraOd  program.  A  ddwgger  can  do  structural  msgrping.  so  that 
the  user  can  uae  line  numbers  and  variables  names  from  the  source  program,  but  still 
expose  the  pataUeliam  [Cft  90]. 


1.5  This  presentation  _ 

In  CHAFTIER  2,  we  give  some  backgrouad  information  on  debuggers  and  parallelizing 
oompileri.  We  define  a  notation  for  specifying  debuggers  and  define  conrectnesa  for  a 


tOUHCS4JEVgLDgm>rvaNaoPAUTOIIAT1CAtXYI»ailAtlJU7roi»WOO«^ 


soorce-ievd  debugger.  CHAFTER  3  defines  die  scope  of  ibe  proMan  we  are  solving 
and  outlines  tlie  overall  structure  of  a  CQoapikr  and  debugger  based  on  thread  spltoii^. 
CHAPTER  4  describes  the  thread  spiining  transfonnatiaa  and  its  debugger.  CHAPTER 
Sdescr9>esdistributiooforIoopba^panUelismainditsdebuggec.CHAPTER6tnte- 
grates  the  mfonnatioa  in  the  previous  two  chapters  to  present  a  complete  debugger  for  a 
conyiler  that  does  block  and  cyclic  distributions  of  data  and  computation.  CHAPTER  7 
idcntifles  the  limitation  of  the  thread  splitting  apptoaidi.  Chapter  8  summarizes  the  the¬ 
sis  and  identifies  oontributiaiis  and  fiuiire  work. 


•tXmce-UEVlL  OfaUQOMO  OFiUITOMA11CAU.Y  PARAUJEtJaED  mOORAMS 


11 


12 


$OUWCE4JEVELOBMX>Q»KIWAIiro:UtlCJUXYP<W<UIRI7FnWIOQWMW 


CHAPTER  2 


Background 


This  chapter  provides  some  backgrotsul  for  our  work.  In  Section  2.1,  we  describe  the 
execution  mc^ls  for  sequential  and  paiallei  programs.  Sectioo  2.2  describes  the  stnic- 
tuie  of  a  parallelizing  compiler.  Sectioa2J  introduces  the  notation  that  we  use  to  spec¬ 
ify  ddMggers  and  defines  correct  behavior  for  source-level  debugging.  The  notation  is 
used  throughout  the  rest  of  the  thesis. 


2.1  Sequential  and  parallel  programming  models 


The  single  processor  language  used  in  examples  in  this  thesis  is  the  C  language;  the 
results  apply  for  any  coavenfioml  inq«ative  language  such  as  FORTRAN  or  Pascal. 
Programs  with  pointers,  procedure  call,  and  tecursicHi  are  allowed.  Program  flow  graphs 
must  be  struct!^.  That  is.  there  cannot  be  a  jump  into  the  middle  of  a  loop.  Bakei{- 
Baker  77]  describes  a  method  for  generating  strucuired  programs  from  a  r^udUe  flow 
graph  and  any  flow  graph  can  be  made  reducible  by  duplicating  code[Aho  86]. 

The  model  of  execution  is  a  distributed  memory  MIMD  computer.  Processor  names, 
also  called  process^'  indexes,  are  n  dimensional  vectors.  The  dimensionality  n  is  cho¬ 
sen  to  match  the  topology  of  the  interconnection  network.  Parallel  programs  are 
expressed  as  a  collection  of  sequential  programs,  one  program  per  processor.  One  {no- 
cess  is  created  on  each  processor  at  the  start  of  the  program,  no  processes  can  be  created 
after  that  The  programmer  assumes  that  the  processes  execute  asynchronously  on  sepa¬ 
rate  processon;  no  assumptions  can  be  made  about  the  relative  progress  of  execution  of 
processes  except  for  the  explicit  sjmchronization. 

The  topology  and  connectivity  of  the  processor  array  are  unimportant  for  debugging. 
Our  examples  use  linear  processor  arrays  and  2  dimensional  mri  with  nearest  neighbor 
connectioos.  The  sequential  language  is  augmented  with  primitives  for  communication. 


SOURtX4.£VELOEBUGaNQ  OF  AUTOIU11CALLY  PARALUEUZED  PnOGRAUS 


13 


BaBkgfound 


Caaunnaicaiiaa  nHMtbe  via  — nd  and  racaira  swemwitt  that  have  biockiag  seman- 
ticK  diat  is,  a  seod  to  a  processor  widi  a  liiU  biiifo  or  lecidve  finxn  a  processor  that  bas 
aoc  sent  the  data  yet  causes  the  processor  to  stall  untii  the  communicaiioa  actkn  can  be 
conqdeied.  The  amount  of  buffering  for  communication  does  not  affect  debugging. 

hi  onr  examples,  we  use  aaaid  and  raoalTa  statements  that  move  individual  data 
items.  A  sand  statement  has  two  afguments,  a  distance  vector  and  a  value  to  be  seitt. 
The  receive  stmement  also  has  two  aliments,  a  distance  vector  and  a  {dace  to  store  the 
value  received.  A  (Sstaoce  vector  is  an  (rfbet  which  can  be  added  to  the  index  of  the  pro¬ 
cessor  executing  the  commumcatiQo  acdon  to  obtain  the  index  of  the  other  processor  in 
the  communication  actian.  On  a  2D  array  with  a  NEWS  grid  (North,  East,  West,  and 
South),  the  processor  to  the  west  of  processor  (i,  j)  is  processor  (i.  j-l),  so  an  offset  of 
(0.-1)  is  used  to  send  to  the  west  and  an  ofbet  of  (0, 1)  is  used  to  receive  from  the  east  If 
we  want  to  send  to  the  east,  then  the  sender  uses  an  ofbet  (ff  (0,1)  and  the  receiver  mes 
an  offKt  of  (0.-1).  The  processor  to  the  north  of  processor  (i,  j)  is  processor  (if  1.  j).  If 
we  want  send  data  to  the  north,  the  sender  uses  an  offset  of  (1,0)  and  the  receiver  uses  an 
offset  of  (-1,0).  If  a  processor  array  supports  communkaiioo  to  non-neighboring  proces¬ 
sors,  then  a  distance  vector  can  have  a  magnitude  greater  than  one. 

A  mod  fimction  is  tgtplied  to  the  result  of  adding  the  processor  index  and  distance  to 
‘NvrqKiround”  processor  indexes.  For  example,  if  a  l-dimensional  array  has  5  proces¬ 
sors,  and  processor  0  executes  a  ansd  with  a  distance  of  (-1),  then  it  is  sent  to  processcu’ 
4. 

For  convenience  in  writing  programs,  we  also  use  ••ada  and  ruoulvua  which  are 
variants  ••ad  and  r«c«lv«.  These  constructs  have  the  same  semantics  as  sand 
and  racaiva  except  that  they  do  not  use  the  wrap-arouad  coanectkws  of  the  processor 
anray.  If  the  processor  index  plus  the  distance  would  go  outside  the  bounds  (ff  the  |»o- 
oessor  array,  then  the  communicatioo  acdon  is  not  executed.  Reusing  our  previous 
example,  if  processor  0  executes  a  ••ada  with  a  distance  of  (-1),  the  saada  is  not  exe¬ 
cuted.  The  aaada  and  raoalTaa  ate  usefbl  when  we  want  the  processors  on  the  ends 
of  the  array  to  behave  differently  fiom  the  rest  of  the  processors. 


2.2  ParalloHzIng  compiler _ 

The  parallelizing  compiler,  or  pataliei  program  generator,  takes  a  program  written  in 
some  high  level  language  and  outputs  a  parallel  program  that  can  be  executed  on  a  par¬ 
allel  machine. 

A  typical  parallelizing  cmnpiler  has  three  phases:  restructuring,  parallelization,  and  sin¬ 
gle  processor  cranpilaikn.  The  first  phase  is  restructuring,  which  consists  of  source  to 
source  transfonnaikms  that  change  the  stfucture  of  program  constructs  like  loops  and 
cooditiooals.  Both  the  input  and  output  are  sequenciai  programs.  Loop  interchange  and 
strip  miniiig[Padua  86]  are  some  examples  of  restructuring  transfatmmioDs.  Restructur- 
mg  is  used  to  increase  the  locality  of  rtference,  increase  the  potential  for  i»rallelisin, 
and  reduce  the  oommunicaticn. 

The  next  phase  is  pargffeliairioii.  where  the  operarioos  and  data  of  the  program  are 
mapped  to  processors  to  parallelize  executkn.  Hie  input  is  a  single  program,  with  either 


Id 


souHcg4JEVELDauQawaoFniiroiuTKAi±YP/tRAiingrnwioaRMi8 


Smuto  ol  aouraa  Uwal  ctebtigainQ 


sequential  semantics  or  possibly  ponlld  constructs  like  domll  and  oobogla,  and  the 
output  is  a  parallel  target  program  that  implements  a  mapping  of  work  to  ptocessMS. 
The  output  can  be  eitber  asingle  program  whicb  can  be  used  for  ail  processors  or  multi¬ 
ple  programs,  wbeie  eacb  processor  gets  its  own  program.  Most  compilers  emit  a  single 
program. 

If  the  input  to  parallelization  has  a  doall  or  a  sequential  loop  construct,  tfae  ouqwt  is  a 
program  wfaete  each  processor  executes  its  assign^  set  of  iterations  of  the  loop.  As  an 
example,  the  sequential  program  in  FIGURE  1-1  is  the  input  to  pnallelizatioo  and  the 
parallel  program  below  it  is  the  output 

There  are  two  features  of  the  parallelization  phase  that  are  important  to  the  rest  of  this 
work.  First  the  source  and  target  of  parallelizatioa  perform  the  same  cmnputatiom;  no 
changes  to  the  algorithm  are  made.  Second,  the  parallelization  {rfiase  preserves  the  order 
of  execution  of  operations  in  a  single  thread.  In  parallelization,  we  map  operations  to 
processors,  but  we  do  not  change  the  crxnputation  or  the  order  of  execution  of  opeia- 
tioos. 

The  final  phase  is  single  processor  cmnpUation.  The  tool  chain  for  single  processor 
compilation  consists  of  a  cranpUer,  assembler,  and  linker.  The  .csult  of  single  processor 
compilation  is  an  executable  which  can  be  loaded  into  the  memory  of  a  processor. 

The  result  of  each  of  the  phases  of  compilation  is  a  version  of  a  program  with  the  same 
meaning  as  the  source,  but  possibly  in  a  different  representation.  The  scHirce  version  is  a 
program  written  in  a  high-level  language  with  parallel  constructs.  After  parallelization, 
the  result  is  a  program  that  is  still  in  a  nigh-level  language,  but  witb  tbe  distribution 
explidt  After  single  processor  compilation,  tbe  result  is  a  machiiK  language  version. 
AO  of  these  versions  of  tbe  programs  have  the  same  meaning,  but  only  one  is  duectly 
executable  by  tbe  processor  its^. 

2.3  Semantics  of  source  level  debugging 


In  this  section,  we  introduce  a  notation  to  describe  debuggers,  and  use  it  to  define  the 
correct  behavior  of  a  ddragger  for  sequential  programs.  This  notation  is  used  to  define 
debuggers  in  the  rest  of  tbe  thesis.  A  ^bugger  is  a  tool  that  cm  be  used  to  execute  a 
program,  inspect  and  modify  its  state,  and  set  breakpoints.  A  debugger  is  caUed  source- 
level  when  it  executes  tbe  machine  language  version  of  a  program,  but  makes  it  appear 
thra  tbe  source  version  of  tbe  jMogram  is  executing. 

Section  23.1  describes  the  basic  functionality  of  a  debugger  and  the  notation  used  to 
define  debuggers.  In  Section  23.2,  we  define  crarect  behavior  for  a  source  level  debug¬ 
ger.  We  don’t  need  to  introduce  paralleiism  to  d^ne  the  behavior  of  a  source-level 
ddrugger  so  to  simplify  tbe  discussion,  tbe  definitions  in  Section  23. 1  and  Section  2.3.2 
only  apply  to  a  debugger  for  a  sequential  program  nmning  on  a  sequential  machine.  In 
Section  23.3.  we  extend  tbe  notatirai  to  iMuallel  programs  and  machines.  Secticm  2.3.4 
describes  how  to  compose  debuggers. 


SOURCE-LEVEL  OEBUQGVia  OF  AUTOMATICALLY  PARALLELIZED  PROGRAMS 


IS 


Saeliaround 


2^1  0«bug(|«r  nolillon 

The  basic  huctioaatity  of  a  debugger  includes  the  ability  to  set  bieakpdnts,  examine 
variiriJies,  modify  variables,  and  tepoct  the  cunem  locadoo.  The  commands  for  these 
aret  a«e,  axaailjM,  when,  and  cun.  The  ant;  command  changes  the  value 
of  a  variable,  ha  tiiitg*  two  arguments,  the  name  of  the  variable  and  the  value  to  change 
it  ta  The  axamlnn  command  inspects  the  values  of  vatuddes.  Its  argummit  is  the  name 
of  the  variable  and  the  debugger  displays  its  value.  The  rtm  command  executes  the  piD- 
gnm  until  a  breakpoint  or  exceptaoQ  is  reached  or  until  the  program  tenninaiBs.  The 
rvB  oommaiid  two  arguments,  the  program  to  be  executed  and  a  list  of  break* 
pmma.  A  program  may  be  single  stepped  by  running  with  a  breakpoint  list  of  the  special 
token  AXiL.  The  irtMra  command  determines  the  curtem  locatioo  in  the  source  pro¬ 
gram,  which  is  the  next  statement  to  be  executed.  The  command  doesn’t  take  any  argu¬ 
ments,  and  the  debugger  displays  a  unitpte  label  that  identifies  the  statement  For  a  high- 
level  program,  the  label  can  be  thought  of  as  a  line  number  in  the  program. All  of  the 
rfitmanfl*  implicitly  take  the  program  state  as  an  argument 

A  program  state  consists  of  the  label  of  the  statement  about  to  be  executed  and  the  val¬ 
ues  of  user  variables.  The  part  of  the  state  that  contains  the  values  of  variables  can  be 
thought  of  as  a  binding  between  naniaa  and  values.  If  the  variable  is  a  scalar  variable, 
the  mini*  is  just  an  identifier.  If  the  variable  is  an  array  element  then  the  name  is  a  an 
ittentifier  and  array  subscripts.  Examining  a  variable  looks  up  the  value  bound  to  the 
nama,  while  modifying  a  variable  replaces  a  binding  of  a  name  and  value  with  a  new 
one. 

Rather  than  specify  the  entire  debugget  which  includes  a  user  inteiface  and  other  extra¬ 
neous  details,  we  define  a  kernel  that  processes  our  small  set  of  commands.  We  call  tiis 
the  debugger  function  or  D  function.  The  functionality  of  a  D  function  is: 


D: 

Dstack  X  CommaDd  x  Stme  x  Program  x  Bpts  x  VarName  x  VarValue  -* 
StatB|VarValue|  Locatioii 

Tbe  Command  argument  is  any  one  of  the  commands  described  previously.  The  rest  of 
the  arguments  for  the  D  functioaarearguinenisfortheparticularcommand.Somecom- 
manda  do  not  need  ail  the  arguments  for  tbe  D  fimetioa  (e.g.  run  takes  a  program  state, 
program,  and  breakpoint  list  but  does  not  need  a  variable  name  or  value).  In  that  case,  a 
dummy  value  of  ±  is  used  for  tbe  unnecessary  argunmnts  (e.g. 

D(l.  run,  s,  p,  (3) .  1. 1)).  Tbe  Dstack  makes  it  possible  to  compose  debuggers;  we 
delay  an  expianadoa  of  this  until  Section  2  J.4.  Tbe  type  of  tbe  value  returned  by  a  D 
fijiigfimi  depends  on  the  command  Both  tbe  ewt  and  mm  commands  create  new  pro¬ 
gram  stales.  The  command  returns  a  VarVilue,  the  value  of  a  variable.  Hie 

irtMT*  command  returns  the  locadon  in  a  program. 

In  FIGURE  2-1,  we  define  the  behavior  of  a  ddMigger  with  a  hose  ddmggm  where  the 
source  and  target  programs  are  tbe  same  (no  compilation  is  necessary).  The  base  debug¬ 
ger  uses  an  imerpreter  to  directly  execute  tbe  program. 


19 


SOUROI-LEVEL  OtSBUOOmQ  OF  AUTOMAUCALLY  FARALLEUBED  PROGRAin 


SamanttM  ol  aouie*  IwMl  dibwgginfl 


To  mo(kl  (be  bduvior  of  tiK  program,  it  is  assumed  tbai  diere  is  an  interpteter  fuDCtiao, 
4 .  where  4<P<  b)  rttmta  a  new  state  that  is  the  result  of  executiag  a  progran  p  in 
language  L  oo  a  state  s  undl  a  breakpoint  in  the  list  d  is  reached  or  uaitil  the  program 
tetminases.  In  the  new  state,  the  current  hd)el  and  the  values  <rf  any  variables  assigned  to 
by  the  statements  executed  are  changed.  If  the  language  L  is  the  machine  code,  then  4 
models  the  behavior  of  the  processor  and  the  resulting  state  reflects  executing  machine 
insiroctiaas.  If  L  is  a  high  level  language,  then  4  an  interpreta  and  the  resulting 
sate  reflects  executing  statements  (tf  the  source  language.  Most  of  the  details  of  the 
tmaniiai  of  the  language  are  orthogonal  to  the  issues  that  we  wish  to  address  about 
debugging.  For  this  reason,  ail  that  informadon  is  hidden  inside  the  program  state  and 
Che  /  fimctioa. 

In  the  base  debugger,  the  labeljjffmcAaa  extracts  the  current  label  from  a  program 
state.  Ihe  access  function  is  used  to  lookup  the  value  of  a  variable  in  a  state.  The  argu¬ 
ments  are  a  program  stsue  and  the  name  of  a  variable.  The  updau  fimction  is  used  to 
modify  the  value  of  a  variable  in  a  state.  It  takes  a  program  state,  a  variable  name,  and  a 
variable  value  and  returns  a  new  state  where  tbe  variable  is  bound  to  tbe  new  value. 


FIGURE  2*1  Oflfinition  of  the  base  debugger 

BaseDebugger(Dstack,Coinmand,State,Program3pts,VarName,VarValue) 

{ 

if  (Command  =*«  “where”)  { 
return  labeLof(State) 
if  (Command  »  “examine”)  { 
return  access(StatB,  VarName) 
if  (Commaixi  “seO  ( 

reniro  update(StaiB,  VaiName,  VarValue) 
if  (Command  n  “nm”) 

return  40^St3m,  State,  Bpts) 


2.3,2  Corroctnoss  of  sourco-lovol  iJobuggors 

For  correctness,  we  require  that  the  behavior  observable  by  the  user  be  tbe  same 
whether  tbe  base  debugger  of  FIGURE  2-1  is  used  with  a  program  p  written  in  lan¬ 
guage  L,  or  a  source-level  debugger  is  used  with  a  compiled  version  of  tire  program  p . 

When  defining  correctness,  we  can’t  simply  require  thm  ail  tbe  input/output  behavior  of 
the  base  debugger  fimctioa  and  the  source-level  ddnigger  function  be  the  same.  While 
the  meaning  of  tbe  two  programs  are  tbe  same,  they  are  dififerent  veisitms  and  hence  tbe 
program  states  are  different 

To  decide  if  two  ddniggers  provide  equivalent  behavior,  we  require  thm  if  we  start  with 
program  states  that  are  indistinguishable  when  using  debugger  functions  to  examme 


SOURCE-UEVEL  OEBUQQMQ  OF  AUTOMAffCALLY  PARALLEUZEO  PROGRAMS 


17 


fies  ihe  ptogram  sttiB  (rott  and  Mt)  also  leads  to  iadistiiigiBahaUe  pcogfam  stacs.  We 
fiwt  introduoe  definitions  for  coDsiateacy  and  congniency  to  define  contecniess.  Consis- 
tency  is  used  to  oompaie  program  states. 


Daflnilien  2*1  IVnoatataa  Sj  and  S]  aiaesoairtrotfor  debuggers  D,  and  Dj 

(conaiatBit(Jt.  D^,  D^)  it 

D,(JU  where,  Si.  1,1,1)  ^  D^l,  where,  i,,  1, 1, 1)  and 
Vn  (D,(l,  examine,  5,,  n,  1)  a  X>2(i,  examine,  r,.  1.  a.  X)) . 

Ueualy  the  debuggers  are  obvi^  from  the  context  In  itie  case  we  omit  them  and  just 
say  states  Si  and  r,  are  coneietent  or  coosistent(St. 


If  two  states  are  oonsiateat  for  a  pair  of  debuggers,  then  we  cannot  distinguish  the  two 
stales  by  using  the  ddwggers  to  examine  them.  We  use  the  V  because  the  condition  on 
examining  variables  must  be  true  for  all  possible  variable  names. 

Now  we  can  use  the  cmsisiency  of  states  to  define  congruency  of  debuggers. 


Oeflnitbn  2-2  Two  debuggers  D,  and  Oj  are  eongmenttor  programe  p^  and  pj 

(coogruent(D,,  Z)j,  p,,  pj))  if: 


V(s,.jj) 

conaistent(5,,  -* 

Vhcoosistent(Di(JU  nm,  s,,  p,.  h,  1, 1),  Djd,  run,  Sj.  pj,  b,  1, 1)) 
and 

V  (n,  V)  coiisisiBnt(D,(l,  set,  s^,  p,,  JU  n,  v),  set,  jj.  Pj.  -U  n,  v)) 


W:  define  two  (MNiggers  to  be  cmgnient  for  two  pariicuiar  programs  if  execudon  of 
identical  commands  on  consistent  stales  yields  a  new  set  of  consistait  states.  The  two 
commands  for  modifying  states  are  rua  and  e«t.  For  tbe  ns  command,  we  use  V 
because  the  condidoo  must  be  true  for  every  possible  set  of  brealqxnnts.  For  the  awt 
oommand,  we  useV  because  tbe  condidoo  must  be  true  for  every  possible  combinadon 
ot  variable  names  and  values. 

Now  that  we  have  a  way  of  comparing  the  behavior  of  debuggers,  we  can  define  correct¬ 
ness  for  a  source-level  debugger. 


Definition  2-3  A  dabuggar  function  D  is  a  eormef  sourea-lovel  ebbuggar  for  a  transformation  A  whan: 

Vp 


congiuent(D,  basedetxigger,  A(p).  p) 

whara  baaedebuggar  is  tha  dabuggar  dafinad  in  RQURE  2-1 .  A  transformation  is  a 
function  that  inputs  ona  program  and  outputs  anothar  program.  Tha  two  programs  hava 


18 


aOURCeXEm  OEBUOKhNO  OF  AUTOMAIICAIXY  PAIUUEUEED  PROQRAM8 


8«iiMMiliea  of  MUTM  {•wri  dabugging 


thoMino  moaning,  but  aio  in  dlforont  roprooontationo  (hif^Hovot  tanguago  and  machine 
coda).  In  thia  daflnilion,  p  ia  tha  aourca  pro^am  and  Mp)  ia  the  targat  program. 


Intuitively,  a  source-level  ddnigger  is  conea  if  its  behavior  while  executing  the  target 
program  is  the  sane  as  the  behavior  of  the  base  debugger  while  executing  the  source 
program.  In  the  rest  of  the  thesis,  when  we  use  the  words  correct  and  incQRect  to 
desoibe  the  behavior  of  a  debugger;  we  are  comparing  it  to  the  behavior  of  the  base 
ddwgger.  Correct  behavior  has  also  been  called  ru^wcted  behavior  by  ZeUwegerfZell- 
weger  84], 

Some  transformatitms  change  the  behavior  of  the  program  so  much  that  there  cmnot  be 
a  D  fimctioiL  This  problem  exists  for  debuggers  for  sequential  machines  as  well  as  for 
parallel  <nes.  For  example,  it  might  not  be  practical  to  allow  the  user  to  modify  vari¬ 
ables  that  are  part  of  a  common  sub-exptessioo[Hennessy  82].  Examining  a  variable 
may  not  be  possible  if  a  dead  store  (a  dead  store  is  a  store  to  a  vari^ie  that  is  not  used 
later  in  the  program)  that  modifies  the  variable  has  been  eliminated,  and  single  stepping 
may  not  be  possible  if  an  optimization  removes  an  “unnecessary’'  loop.  An  example 
fipom  parallelization  is  when  a  compiler  transforms  a  sequential  algorithm  that  computes 
the  sum  of  an  array  into  a  paallel  one  which  does  the  operaticMis  in  a  differait  order  and 
doesn’t  compile  the  same  intermediate  values. 

Just  because  one  aspect  of  tbs  program  L«  not  observable  by  the  debugger  does  not  mean 
that  a  debugger  cannot  be  used  at  all.  If  a  variable  cannot  be  examined  because  a  dead 
store  was  eliminated,  then  the  debugger  should  not  allow  the  user  to  inspect  the  vari¬ 
able:  inspecting  other  variables  should  not  be  affected.  Funbermore.  the  debugger 
should  never  give  misleading  information.  The  debugger  should  either  bdtave  the  same 
as  a  correa  source-level  debugger  or  it  should  indicate  to  the  user  that  correct  behavior 
is  not  possible.  Zellweger  calls  this  behavior.  We  call  transformations  that  have 
a  correct  source-level  debugger fttUy  de^ggable  and  transformations  that  do  not  par- 
tially  debuggable. 

2.3.3  Extending  th«  d«bugg«r  dnfinitlon  to  paralM  programs 

Only  a  small  number  of  changes  are  needed  tt>  extend  the  notation  from  the  previous 
sections  to  cover  parallel  machines  as  well.  The  state  for  a  parallel  program  is  a  set  of 
stales,  one  for  each  processor.  Each  state  contains  the  current  point  in  the  program  for 
the  processor  and  the  value  of  that  processor’s  variables.  A  program  is  a  set  of  single 
processor  programs.  A  function  for  a  parallel  program  is  a  set  of  D  functions  for  sin¬ 
gle  pocessms.  When  a  debugger  for  a  parallel  program  must  perform  an  action  on  a 
processor,  it  selects  the  debugger  and  state  for  that  processor  frcan  the  sets.  For  exanple, 
if  the  debugger  wants  to  exanine  the  value  of  a  variable  in  processor  3,  it  extracts  the 
stale  of  the  third  processor  from  the  set  of  states,  extracts  the  debugger  for  processor  3 
firom  the  set  of  debuggers,  and  applies  the  debugger  to  that  state  to  lookup  the  value  of 
the  variable  requested.  If  the  debugging  action  modifies  a  state,  as  is  done  by  the  set 
command,  the  resulting  state  is  merged  back  in  to  the  set  of  states  tha  is  considered  the 
current  state. 


SOURCE-LEVEL  OEBUQOMG  OF  MITOMARCALLY  PARALLELIZED  PROGRAMS 


19 


DwkBround 


2L3.4  Composing  Dftmetiofw 

Wb  CQQsinict  acoaqnler  by  conposmg  co(k  timsformatioitt;  we  also  need  a  way  to 
oonqww  tbe  D  fuDctiou  for  tbom  trans&mnadoiu  to  ccnsiraa  a  debi^ger.  If  tbe  com¬ 
piler  ttaasiaaes  ibe  code  fipom  source  to  taiget  in  several  siepa,  we  wait  to  build  a  debug- 
ger  for  each  stqi  individiiaily,  tben  ooossuct  tbe  debugs  for  die  entire  transfonnatioa 
by  putting  togete  tbe  debi^en  for  tfaoee  siepa.  For  example,  some  C-m-  comptkrs 
flnt  mnslaiB  tbe  C-H>  code  to  C  code,  iben  tnnalaiB  tbe  C  code  to  alterably  code.  Wb 
could  buiU  a  ddMgger  for  sucb  a  syaem  by  building  a  debugger  for  C  code,  buiktiiig 
anotfaer  debugger  that  awanes  the  source  level  is  C-m-  and  tbe  target  levd  is  C  and  then 
puttogBtbertboietwodebuggentoooiistniaanewooewberetbesooioelevelisC-t-*' 
and  tbe  target  levd  is  machine  code. 

A  stack  of  debuggets  comprises  individual  debuggers  that  translate  requests  from  tbeir 
source  levd  to  ihdr  target  levd.  Requests  are  passed  down  tbe  stack  to  the  ntadiine 
level  and  results  are  passed  back  to  the  top.  Om  debugger  is  at  a  lower  levd  tbau 
another  if  it  is  doser  to  the  machine  levd  in  tbe  debugger  stack.  Tbe  lowest  level  debug¬ 
ger  is  called  the  base  debugger,  which  directly  executes  commands. 

Each  debugger,  except  tbe  base  ddxiggei;  assumes  that  it  can  use  a  lower  levd  debugger 
to  nianipulatB  the  taiget  stme.  In  ^ect,  a  debugger  just  translates  commands  to  manipu- 
bue  soufoe-levd  objects  into  commands  to  mampukte  target-levd  objects. 

For  example,  if  a  program  transformation  replaces  all  occurreoces  of  tbe  variable  11 
with  13,  then  the  debugger  for  that  transformatioo  translates  retpiests  to  examine  the 
variable  11  to  a  request  to  examine  the  variable  13  and  passes  on  all  other  requests 
unchanged.  Tbe  same  applies  for  tbe  modifying  variables.  The  D  function  for  the 
renaming  transformation  is  called  d_ll_13  and  can  be  found  below.  Tbe  fonctioa 
first  selects  tbe  SrstD  function  on  the  debugger  stack.  The  function  rust  returns 
the  debugger  stadc  with  the  first  D  ftmcdoa  removed.  Tbe  fimetion  apply  applies  a 
fimetioo  to  a  set  of  arguments.  It  has  tbe  same  semantics  as  tbe  Common  LispiSteeie 
84]  fonction  of  tbe  same  name.  If  tbe  variable  f  eo  coniams  tbe  function  bar  then 
•PPly(£oorl«2,3)  is  tbe  same  as  bard, 3, 3). 

The  first  mgutnent  of  d_ll_13,  dataek,  is  tbe  debugger  stack.  Tbe  debugger  stack  is 
a  list  off)  functions.  Tbe  top  of  tbe  stack  is  the  Pfanctioo  that  tbe  current  D  function 
can  use  to  manipulatB  its  taiget  state.  Each  D  function  peels  tbe  top  of  tbe  stack  when 

calling  tbe  lowo'  levd  D  fonctioa.  Tbe  new  top  of  sta^  for  tbe  lower  levd  debugger  is 
the  D  fonctioa  below  it  in  tbe  stack.  When  a  dagger  calls  a  lower  level  debugger,  it 
applies  the  lower  level  debugger  to  a  set  of  argnmenis.  The  first  argument  passed  tbe  D 
functioa  is  tbe  new  debugger  stack  with  the  first  deo^  removed. 

The  lowest  levd  debugger  on  tbe  stack  (tbe  base  debugger)  must  perfotm  the  com¬ 
mands  on  the  real  target  state;  it  does  not  pass  on  the  command  to  a  soother  debugger. 


20 


aOUR«4MJBVEL  OEBUQONO  OF  AUTOMATICAUY  PARALLBiZED  PROQRAIIS 


Samantic*  ol  MHiVM  hw*!  dabuggiiifl 


if  (ccmaiumd  nm)  ( 

retun  apply(f1nt(dsiatck)4est(dstack), 

nin,sttiB,prognan,bptsJJL) 


} 

if  (command  ■■  wbere)  ( 

retun  apfdy(&st(dsCKk)4tst(dstack), 
wfaere„mue.  t .  1 . 1 ) 


} 

if  (command  »  examine)  { 

ifCnamea-'nn 

return  appty<&rst(d$tack),re3t(dstack). 


} 


else  return  apply(first(dsiadc)jest(dsiack), 

examine^statej^namej.); 

} 

if  (command  »  set)  ( 
if  (name  **(1*0 

return  apply(first(dstack)jest(dstack), 
set..state>i."i2",value) 
else  return  apply(first(dsiack)4tst(dstack), 

secstate,  1  ,name.value); 


} 


As  an  illustratioa,  assume  we  cmnpose  tbe  debugger  defined  above,  and  a 

similar  debugger  for  a  transformafion  that  renames  tlie  variables  12  to  i3.  Tbe  base 
debugger  is  called  b.  Tbe  debugger  siadc  for  this  set  of  transfonnatiaiis  is  (d_ll_12, 
d_12_13,  b).  If  tbe  user  wants  to  examine  tbe  variable  11.  then  tbe  debugger  would  be 
applied  firom  tbe  top  of  tbe  stack  as  follows; 


d_il_i2  ( (d_i2_i3, b) ,  ax2unine,  state, X,J.,  *il* , i.) ) 

Tbe  debugger  d_ll_12  would  pass  the  request  to  tbe  next  debugger  on  die  stack  as 
follows: 


d_i2_i3  ( (b) ,  examine,  state,  J.,1,  'i2'  ,X) 

The  debugger  d_12_13  would  pass  tbe  request  to  tbe  Imse  debugger 
b(  ()  ,  examine, state, UL. ,  'i3*,X) 

The  base  ddnigger  would  look  up  tbe  value  of  13  and  pass  it  back  up  to  d_12_13, 
wbicb  would  return  it  to  d_ll_12,  wbicb  would  return  tbe  value  whicb  is  displayed  to 
tbe  user. 


8OUfK>;4intEL0EBtKKSmQOFMJTOMAT1CAU.Y  PARALLEUZED  PROGRAMS 


21 


2.4  Summary 


TheinoddsofexecatiooforpflialldaiidiequBntiaiinartiin<»weiegiaDdiiced.The 
stnicoue  of  a  panllelizing  cciiiq>Uer  has  been  described  A  wM&tkn  fw  specifying 
deboggets  bat  been  ioBodiiced  and  it  was  used  to  define  the  conect  bebavior  for  a 
source-level  debugger. 


KHJMX-lfVELOeBUaaMO  OF  AimHIAIICAlX  PAIIAUfiUZB)  FROQlUin 


CHAPTER  3 


Approach 


In  this  chapter,  we  describe  otir  approach  for  source  level  ddwgging.  W:  focus  on  the 
parallelizatiaa  phase  of  comptlatiou,  since  it  ^»ns  the  gap  between  sequential  and  par¬ 
allel  programs.  Baiallelizatioa  requites  both  structural  mapping  and  dynamic  order  res^ 
toration.  Dynamic  order  restoratica  is  needed  because  parallelization  converts  the 
program’s  single  thread  of  control  to  multiple  threads  of  centred  hence  out  of  order  exe- 
cutioo  is  possible.  Scrucoiial  mapping  is  needed  because  the  mechanics  of  paralleliza- 
don  require  that  the  program  structure  be  altered. 

We  attack  these  problems  separately  by  dividing  the  parallelizatiaa  phase  of  compUa- 
tico  into  two  parts,  hi  the  first  phase,  we  restructure  the  code  to  expose  aU  the  structural 
diaoges  necessary  for  parallel^,  hi  the  second  phase  we  parallelize  the  code  by 
extracting  multiple  programs  from  a  single  processor  program. 

In  Seetkm  3.1,  we  dtfne  the  scope  of  the  problem  we  solve.  In  Section  3JZ,  we  review 
the  basic  fimctioaality  of  what  a  debugger  for  parallelized  programs  shoidd  do.  This  is 
followed  in  Section  3.3  by  ai  introduction  to  our  methodology  for  constructing  debug¬ 
gers. 


3.1  Problem  scope _ _ 

If  compilatioa  parallelizes  the  program  but  does  not  change  the  compatadon  or  the 
order  of  execudon  of  operadons  on  a  processor,  then  the  same  values  are  computed  in 
source  program  order  and  it  is  possible  to  build  a  debugger.  If  compiladon  alters  the 
compuadon  (e.g.  converting  a  sequential  reduction  into  a  parallel  reduction)  or  changes 
the  order  of  execution  of  operations,  then  it  may  not  be  possible  to  construct  a  debugger. 
~  depending  on  the  degree  u>  which  the  computation  has  been  changed,  hi  this  thesis,  we 
only  study  the  problem  of  debugging  when  the  oamputadon  has  not  been  changed. 


StMMCE-lCVhL  OCBUOQMa  OP  AUTtMIATICALLY  PARALLEU2B)  PftOORMIS 


23 


AppfOMli 


As  we  described  in  the  ivevioas  dii|)Ca;  oonpilaiioo  OB  be  (fivUed  iMo  tbiee  pbaies: 
iestnrttifing,pendkli?;siioo.aDdsiB^procesaorcompilatioa.  Adebaggef  for  the 
oomidler  ca  be  thought  of  at  ite  axopositioB  of  debugfcn  tbr  the  iadivMhiai  phases. 
Resaucturing  iimafonnatioBaaDd  sia^  proceaaor  compilation  are  sequential  to 
aeqtKDiU  tnoafonnatioiis  aod  cao  change  the  oooiputariao  and  the  onkr  of  opetadoBS. 
ftrandfaarionconvem  a  program  fiom  sequential  to  pMiilel  but  does  not  the 

oonputatiaa. 

n#  tiwMW  pli— f  fa  (0  ||)e  debugger 

wriaBr.Sei|ueotial»aequeatiaiaHMfonBatiaaB,wheihertheya0ea|ipiiedbyanainic- 
tntcr  or  an  optimizing  cxanpikr  for  a  acqumtiai  language,  have  received  a  great  deal  of 
aaenriaB[Feiler  82]rMnea  78](ZeUweger  84][CouaM  88][Bfoolcs  92][CoppennaD 
90][HdBie3sy  82]. 

In  dtis  thesis,  we  only  study  the  debugging  probiems  asaodaied  with  the  transitioa  finom 
sequeotiai  to  patallel.  which  occurs  during  pataliriiTaikinCPineo  91][Cohn  91].  Parai* 
leHaation  does  not  change  the  computations  of  a  source  program  or  the  order  in  which 
they  are  computed.  To  factor  out  the  problems  assnciaiBri  with  ddtugging  sequential  to 
sequential  tn^onmoioos.  sve  assume  that  them  is  either  a  debugger  for  the  lestructur- 
ing  phase,  or  that  no  testrucoring  is  dooe  by  the  compiler.  Fufthennore.  we  assume  that 
there  is  a  debugger  for  the  single  processor  compiler. 


3J2  What  a  debugger  for  paraltellzaMon  must  do _ 

The  work  of  the  0  function  can  be  divided  into  dynamic  Older  lestoratioa  and  sinictural 
mappii^  In  Section  3.2.1,  we  describe  when  and  why  dynamic  order  resaoration  is  nec' 
essaiy  and  what  it  must  do.  In  Section  32.2  sve  anssver  the  same  questions  for  sttucnval 
mapping. 

32.1  Dynamic  oidor  rnnto  ration 

Dynamic  order  restoration  makes  it  appear  that  the  order  of  execution  of  operations  is 
the  same  as  the  order  in  the  source  program.  It  is  necessary  whenever  the  compiler 
removes  an  ordering  constraint  that  was  present  in  the  sequential  prograna.  Compilers 
remove  ordering  constraints  to  exploit  the  paraileiism.  For  example,  the  semantics  of 
most  imperative  languages  impiy  that  iterations  of  a  loop  are  executed  sequentially.  If 
there  are  no  dependences  between  loop  iteretions,  then  the  compiler  cm  generate  paral¬ 
lel  code  that  does  not  obey  that  ordering  oonstraint  and  allows  iterations  to  be  execuaed 
inparalleL 

A  proUem  for  debugging  arises  when  we  allow  operations  to  be  executed  in  parallel.  If 
the  program  execution  is  imemipted  became  the  program  hit  a  breakpoiat  or  the  user 
aborted  the  program,  it  may  appesrihat  operations  are  executed  out  of  source  order.  Out 
of  order  executiou  makes  it  impossible  for  the  debugger  to  point  to  a  position  in  the 
source  program  where  every  opetation  before  that  point  has  been  completed  md  no 
operations  after  that  point  have  been  initiated.  FIGURE  3-1  iUustraces  this  point  The 
top  <rf  the  figure  is  the  same  sequential  and  paralldprognuns  as  in  the  example  used  in 
the  previous  chapter.  The  booam  ol  the  figure  contains  traces  of  the  execution  of  the 
sequential  progiam  in  the  center  column  and  traces  of  processor  0  and  processor  1  to  the 


24 


gUIHCi.UWILDIBU004iQOFAUTOIIATtCAtlYF<>RlUlWPrDI4tOCgUli8 


FIGURE  3-1 


ParaM  axaeutkMi  givaa  tha  appaaranea  of  out  of  ordar  axaeution 


SeqiMQtial  source  pn^ram 


int  i; 
int  At30] ; 

Cor  (i  a  0;  i  <  30;  !+■*•){ 
AUl  =  1; 

} 


Processor  0  Processor  1 


int  il; 
int  Al[15]; 

for  {il=  0;  il  <  15;  il++)  ( 
Allil]  a  1; 

) 


int  i2; 
int  A2(15]; 

for  (i2=  0;  i2  <  15;  i2++) { 
A2[i2]  a  1; 

) 


Trace  of  sequential  program 


Trace  of  processor  0 

il  =  0 
il  <  15 

lAlfill  a  1  1 


il++ 
il  <  15 
Al[il]  a  1 


i  a  0 
i  <  30 
A[i]  a 

i++ 

i  <  30 
A[il  a 

i++ 

i  <  30 
A(il  a 


i++ 

i  <  30 


Ati]  a 


a 

a 


1 


a 

a 


SOURCE-LEVEL  OEBUQOMQ  OF  MITOHATICALLY  FARALLEUZEO  PROGRAMS 


25 


Approach 


kft  aod  right  Tbe  arrows  telaie  opentkm  in  the  sequential  and  pvalld  programs.  The 
boxes  anuad  saiemeois  in  the  processor  0  and  prooessor  1  tnoei  indkaie  the  opera- 
dona  that  each  processor  will  execute  next  For  this  particular  state  of  the  parallel  pro- 
gran,  there  is  no  place  in  the  sequential  progran  trace  that  we  can  point  to  and  say  that 
evety  operation  before  that  point  has  been  executed  and  no  operatioo  after  that  point  has 
been  executed. 

Oven  a  point  considered  the  cunent  point  in  the  source  execution.  operatioBS  which 
occur  after  tfaas  point  in  the  sooice  prograDB,  and  have  already  been  executed  are  called 
earfy  and  opetadons  which  appear  before  the  current  point  in  the  sequential  exeonion 
and  have  not  been  executed  yet  are  called  iBte. 

RGURE  3-2  Early  and  lata  operadom 


Trace  of  processor  0 

iO  s  0 
iO  <  15 

EoHojlZO 

i0++ 
iO  <  15 
A0[i0]  =  1 


Trace  of  sequential  program 

^  i  3  0 

i  <  30 

^^Ati]  =  1 

cr  N. 

i  <  30 

ACi]  =  1  Os. 

40-  i-.  ^ 

- |i  <  30  I 

— A[iJ  3  1 

.Jll 

i  <  30 

A[iJ  3  1  — M 


[£]  Early 
fP  Late 

Trace  of  processor  1 

il  *  0 
il  <  15 
Al[il]  3  1 
il++ 
il  <  15 
Al[il]  3  1 
bi++  I 


b  FIGURE  3-2  we  have  another  Ulustradon  of  the  same  state  of  the  program.  The  box 
around  tbe  statement  in  the  sequendai  program  trace  indicates  the  statement  that  the 
debugger  has  chosen  as  the  current  statement  (calling  it  the  cunem  statement  implies 
that  it  has  not  been  executed  yet).  The  debt^ger  could  have  selected  other  statements  as 
the  current  one;  the  constraints  on  choices  are  explained  in  CHAPTER  4.  Anything 
above  the  selected  line  in  the  sequential  trace  which  has  not  been  executed  is  a  hue  oper- 
adoo.  The  arrows  for  those  operadons  are  marked  with  L’s.  Anything  below  the  selected 
line  in  the  sequential  tnoe  which  has  been  executed  is  an  early  operation.  The  arrows 
for  those  operadons  are  marked  with  Fs. 

If  an  early  or  late  operation  changes  tbe  value  of  a  variable,  and  there  is  no  dynamic 
order  restoratkn,  then  tbe  behavior  (tf  the  debugger  wiU  be  inconect  For  an  early  oper- 
atioa,  the  user  should  see  the  value  before  it  was  updated.  Howevei;  only  the  value  after 

aOUnCC4JI.ELOEmiCK2MaOFAUTOiaATICAIX  PAWAIIFUTTO PROGRAMS 


WiMi  a  dtbuggw  for  paraMintion  miMt  do 


die  update  IS  available.  Ib  tbe  figiire,p(ocessor  1  does  an  eariy  update  to  the  anay  XI.  If 
tbe  user  inspected  A1  tt] .  they  would  see  the  new  value,  1.  wbea  tbey  sbould  have  seal 
the  dd  value,  0. 

If  there  is  a  late  operadoo  that  modifies  a  variable,  the  user  should  see  the  value  after  the 
late  operatioo  updates  the  variable,  which  has  not  happened  yet  In  the  figure,  processor 
0  has  a  late  operaiioa  that  updates  the  »ray  AO.  When  inspecting  Al,  the  user  sees  the 
pre-update  vriue  of  0  when  the  post-update  value  of  1  should  be  seen. 

Dynamic  order  lestoraiiaa  also  ensures  that  the  user  sees  events  such  as  breakpoints  in 
the  order  that  they  would  have  occuned  in  the  source  prognan.  If  a  late  operation  will 
causeabreaiqxMnt  when  it  executes,  then  the  user  should  see  the  breakpoint  associated 
with  the  late  operation  be^re  seeing  the  event  that  made  the  program  stop  in  its  current 
state  because  the  late  operation  has  an  eariier  virtual  time. 

Dynamic  order  restoratioa  can  either  force  the  parallel  program  to  execute  operatiois  in 
source  order  or  it  can  allow  the  program  to  execute  operations  out  of  order,  but  not  let 
tbe  user  examine  any  state  that  has  been  modified  by  out  of  order  operations.  Choosing 
between  these  two  methods  depends  on  the  situaliao;  methods  are  discussed  in  CHAP¬ 
TER  4.  If  tbe  debugger  doesn’t  force  execution  in  source  order,  each  command  must 
have  a  set  of  restrictions  to  ensure  that  the  user  cannot  see  tbe  efea  of  out  of  order  exe¬ 
cution.  For  tbe  irixar  •  command,  tbe  debugger  must  pick  a  point  in  the  program  to  call 
the  current  point  This  determines  the  set  (rf  late  and  early  operations.  Tte  debugger 
must  be  sure  that  there  are  no  late  c^wratious  that  will  cause  breakpoints  or  exceptions. 
For  the  an—lna  command,  tbe  d^ugger  must  ensure  that  the  user  does  not  inspect  a 
variable  that  is  written  by  a  late  or  early  operation.  For  tbesat  command  the  debugger 
must  ensure  that  the  user  does  not  modify  the  value  of  a  variable  that  is  read  or  written 
by  a  late  or  early  opeiation.  When  setting  a  breakpoint  the  debugger  sbould  ensure  that  a 
breakpoint  is  not  set  for  an  early  operatioo. 

Late  and  eariy  operations  are  closely  connected  to  roll-back  and  roU-forward  variables 
as  defined  by  HennessyCHennessy  82],  Variables  written  by  eariy  operations  are  roll- 
badc  variables.  Variables  written  by  late  operations  are  roll-forward  variables.  Comput¬ 
ing  the  sets  of  roll-forwanl  and  roll-back  variables  is  sufficient  if  tbe  debugger  wants  to 
determine  if  it  is  safe  to  examine  a  vari^le,  but  more  information  is  needed  if  we  want 
to  determine  if  it  is  safe  to  modify  variables  or  set  breakpoints.  Computing  tbe  late  and 
eariy  operations  provides  this  extra  infoimation. 

3.22  Structural  mapping 

Structural  mapping  relates  variables  and  statements  in  the  source  code  to  variables  and 
statements  in  the  target  code.  If  tbe  program  is  translated,  then  structural  mapping  is 
needed. 

Pnallelizatiou  usually  entails  sane  chan^  in  tbe  structure  of  tbe  program.  If  the  com¬ 
piler  implements  a  shared  memory  model  on  a  distributed  memory  machine,  then  the 
names  for  variables  used  in  the  source  program  are  not  tbe  same  as  the  names  used  in 
tbe  parallel  program.  In  our  example  in  FIGl^  3-1,  tbe  variable  1  has  been  renamed. 
If  the  elements  of  an  array  are  distributed  across  a  set  of  processors,  as  is  done  for  the 
array  A  in  the  etample,  then  tbe  subscripts  used  to  refisrence  tbe  array  are  changed. 


SOORCg-LEVEL  D6BUOCBNG  AlfTOMATlCAU.Y  PHHAllElfiffiD  PB0GWAM8 


27 


Approach 


SooevariaUeslikBloapcouatenarefq)licate(looallpcooes8an.evaaasliaiedniem- 
Qcy  machines.  Bffilieniioie,  sooie  variables  might  not  have  the  saoie  value  as  they 
would  have  had  in  the  sequential  program.  In  the  example,  the  loops  for  both  processors 
go  from  0  to  14  while  the  loop  in  the  source  program  goes  from  0  to  29. 

Hie  structure  of  the  code  is  changed  as  weO.  bistead  of  a  single  program,  there  is  a  set  of 
programs,  one  for  each  processoc  the  compiler  converts  a  sequmiiial  lo(9  into  a  paral* 
lei  loop,  then  the  loop  stnicoire  most  be  changed  so  that  every  processor  executes  a  sub¬ 
set  the  iteraiiaiis. 

Stnicoiral  mapping  is  a  translation  of  names  between  objects  in  the  source  program  tuid 
target  programs.  When  the  progranmer  uses  the  ddxigger  to  examine  or  modify  a  vari¬ 
able,  structural  mapping  converts  the  name  that  is  used  in  the  source  program  to  the 
name  in  the  target  program.  If  that  variable  happens  to  be  an  element  of  a  distributed 
array,  then  structural  mqiping  compotes  which  processor  the  element  resides  on  md  the 
address  at  which  it  is  stored.  In  the  example  of  FIGURE  3-1,  strucuirai  mai^g  would 
determine  that  array  element  A  ti]  resides  on  processor  0  if  i  is  even  and  processtu'l  is  i 
is  odd.  It  would  also  change  the  index  [i]  to  Ii/21.  If  a  variable  is  replicated,  then  the 
debugger  most  know  which  copy  is  the  ai^iropriate  one  to  examine  and  which  set  of 
copies  is  the  appropriate  one  to  modify.  If  a  vatiaUe  does  not  have  the  same  value  as  in 
the  source  program,  then  the  debugger  must  know  how  to  compute  the  source  value 
frtan  the  target  value  to  examine  the  variable  and  it  must  also  Imow  how  to  compute  the 
target  value  from  the  source  value  to  modify  the  variable.  To  examine  the  loop  counter 
in  the  example,  we  must  first  decide  which  processor  has  the  appropriate  copy.  If  it  is 
processor  0,  we  mnlt^iy  the  loop  oounier  by  Z  Hit  is  processor  1,  we  multi^y  the  loop 
counter  by  2  and  add  1.  For  the  irtior*  cormnand,  we  must  know  bow  to  map  lines  in 
the  poralld  program  to  lines  in  the  sequential  program.  To  set  breakpoints  with  the  run 
cornmand,  we  must  know  the  inverse  mapping,  from  lines  in  the  sequential  program  to 
lines  in  the  paralld  program. 


3.3  Debugger  methodology 


Ihe  debugger  methodology  presented  in  this  thesis  allows  us  to  separate  dynamic  order 
testoradon  from  structural  mapping  when  building  adebugger.  As  is  explained  later, 
dividing  the  two  allows  us  to  study  the  dynamic  ctder  testoratioa  problem  independent 
of  the  parallelizing  transfrxmation  used  by  a  compiler. 

A  debugger  that  does  dynamic  order  testoradon  must  know  the  total  ordering  of  opera¬ 
tions  in  the  paralld  progtan.  Structural  mapping  requires  infrxmation  about  the  reia- 
donship  of  lines  and  variables  in  the  source  and  target  programs.  We  can  separate 
structural  nuqtping  from  dynamic  order  testoradon  in  a  debugger  by  dividing  the  paral- 
lelizadon  phase  of  compilation  inm  two  steps:  distrUnamn  fiillowed  by  t/mad  splitting. 
The  input  to  distribution  is  the  same  as  the  input  to  parallelizadon  and  is  called  the  a 
program.  The  output  of  distribudon  is  another  sequaidal  program  called  the  0  program. 
Thread  splitting  inputs  the  ^  program  and  outputs  the  pa^d  to  |»ogtam,  where  ca,-  is 
the  sub-program  that  executes  on  processor  i.  The  m  program  is  the  output  of  parallel¬ 
ization. 


SOURCE-LEVEL  DEdUQQMQ  OEAUTOMARCALLY  PARALLELIZED  PROGRAMS 


Oabiiggar  nwthodoiogy 


The  operationt  in  the  p  program  have  a  1  lo  1  telatioiMhip  with  the  operadoos  of  the  o 
program.  Since  the  P  program  is  sequential,  it  defines  a  total  ordering  at  execution  of 
operation  of  the  (0  program.  The  debugger  uses  the  total  ocdering  for  dynamic  order 
lestoraiiaiL  AO  of  tte  stnicuirai  effects  of  paralielizatioo  have  been  factored  out  ami 
exposed  in  a  program  without  paralleiism  (the  P  program).  Structural  effects  include 
leplicated  statements  and  variables,  altered  loop  structures,  and  distributed  variables. 

The  ddxigger  is  structured  in  a  manner  similar  to  the  compUet  There  is  a  debugger  for 
distribuiimi  and  thread  splitting;  the  debugger  for  parailelbmion  is  the  composition  of 
the  two.  The  debugger  for  distribution  does  stnicniral  mapping  but  does  not  need  to  do 
dynamic  order  restorukm  because  both  the  input  and  output  are  sequentiaL  The  debug¬ 
ger  for  thread  flitting  does  dynamic  order  restoration  and  some  sim]fie  structural  map¬ 
ping. 

A  ddNigger  where  structural  mapping  and  dynamic  order  restoration  are  separated  has 
two  advantages  when  cranpared  to  a  ddnigger  where  they  me  done  together.  The  chief 
advantage  is  that  it  lets  us  isolate  the  fimctionality  that  mam^es  program  parallelism  in 
a  part  of  the  ddnigger  that  is  reusable  for  different  compilers.  Wben  we  construct  new 
paraUelizing  transformations,  a  new  distribution  nansfotmaticn  is  needed,  but  the 
thread  splitting  transfonnatioa  always  stays  the  same.  Thus,  the  part  of  the  debugger 
that  handles  thread  spUoing  can  be  reused  wben  the  debugger  is  retargeted  for  a  com¬ 
piler  that  uses  a  different  paraUelizing  transformation. 

The  second  advantage  is  that  structural  mapping  is  simplified  wben  dynamic  order  res- 
tormkm  has  already  been  done.  This  is  because  aU  problems  associated  with  out  of  order 
execution  have  been  resolved.  To  iUustraie  this  point  we  use  the  previous  example  of 
parallelization  where  processor  0  executes  aU  the  even  iterations  of  a  loop  and  processor 
1  executes  ai'  the  odd  iterations.  In  FIGURE  3-3,  the  top  program  is  theaprogiam,  the 
middle  program  is  the  P  {nogram,  and  the  bottom  program  is  the  (O  program.  We  explain 
P  programs  in  detail  in  Section  4.1.1,  for  now  it  is  only  necessary  to  observe  the  duplica¬ 
tion  of  the  loop  counters.  Assume  that  the  paraUel  i»ogtam  is  interrupted,  and  both  pro¬ 
cessor  0  and  processm  1  are  about  to  execute  the  assignment  statement  in  the  body  of 
the  loop.  If  the  user  were  to  ask  to  inspect  the  loop  counter,  it  is  unclear  which  counter  is 
the  appropriate  one  to  show  to  the  user  by  inspecting  the  state  of  the  oa  program.  How¬ 
ever,  if  we  do  dynamic  order  restoration  first,  we  can  relate  the  current  state  of  execution 
to  a  point  in  the  execution  of  asequential  program,  which  is  our  P  program,  bi  this  case, 
it  would  indicate  that  die  next  statement  to  be  executed  is  the  assignment  to  XI  in  the 
body  of  the  loop  in  tbe  P  program,  which  makes  it  is  clear  that  the  correct  copy  of  the 
loop  counter  to  inspect  is  il.  Doing  dynamic  order  restoration  first,  in  conjunction  with 
tbe  1  to  1  mapping  between  statements  in  the  P  and  to  programs,  has  simplified  struc¬ 
tural  mapping. 

'fhe  main  disadvantage  of  separating  structural  mapping  and  dynamic  order  restoration 
is  that  we  must  construct  the  intermediate  program  called  the  P  inogram.  which  would 
not  otherwise  be  necessary.  However,  we  ^  uot  believe  that  this  disadvantage  negates 
the  advantages — ^we  show  in  CHAPTER  5  that  constructing  a  P  program  for  the  com¬ 
mon  distributions  is  straightforward. 


49 


SOURCE-LEVEL  OEBUQQMQ  OF  AUTOhATICALLY  PARALLEUZEO  PROGRAMS 


ApfMoaeh 


nGURE  M 


0,0.  and  09  program 


s 


oprogram 

int  i; 
int  A[30]; 

for  (i  a  0;  i  <  30;  i++){ 
Atil  *  1; 

} 


Xil 


0  program 

int  10, il; 
int  AO [15] ,A1[15] ; 
for  (iO  a  0,il  a  1;  iO  <  30;  iO  +a2,il+a2){ 
A0[i0]  a  1; 

Al[il]  a  1; 

} 


«flo 


int  iO; 
int  AOtlS]; 

for  (iO  a  0;  iO  <  30;  iO  +a2){ 
A0(i0]  a  1; 

} 


/(aim}gram  ^ 

\ 


a)j 


s 


int  il; 
int  A1[1S]; 

for  (il  a  0;  il  <  30;  il  *s2) { 
Allii]  =  1; 

} 


The  next  two  chapters  complete  the  description  of  debugger  methodology.  Chapter  4 
describes  the  thr^  splitting  transformation  and  its  dagger  and  Chapter  5  de^bes 
distribution  and  its  debugger. 

3.4  Summary _ 

Wb  identified  the  scope  of  the  problem  we  are  trying  to  solve,  which  is  the  paralleliza¬ 
tion  ph^  of  compilation.  Dytmmic  order  restoration  and  structural  mapping  are  needed 
fot  debugging  pat^elization.  We  described  a  methodology  for  factoring  out  dynamic 
order  restoration  from  parallelization,  so  that  it  can  be  handed  separately  fixmi  stnic- 
unalmai^g. 


SOURCE-LEVEL  OEBUOCaNu  OF  AUTOUAT1CAU.Y  PARALLELIZED  PROORA'VM 


CHAPTER  4 


The  thread  splitting 
transformation 


This  chapter  describes  the  thread  s^^oing  aansformaiioa  and  its  debugger.  The  thread 
splitting  transfonnaiioa  converts  a  sequendai  program  into  a  parallel  program.  Its 
source  is  the  ^  program  and  the  target  is  the  m  program.  All  of  the  changes  which  are 
necessary  for  distribution  are  exposed  in  the  P  program.  These  changes  inJude  the  rep¬ 
lication  and  disttibutioa  of  variables  and  the  resinicoinng  of  loops.  In  addition,  every 
statement  in  the  p  program  is  labeled  with  the  processor  on  which  the  statement  should 
be  executed.  Thread  splitting  uses  the  labeling  to  generate  a  parallel  program. 

The  0  program  has  an  execution  model  of  a  single  thread  of  control  executing  in  multi¬ 
ple.  disjoint  address  spaces.  In  generating  the  a>  program,  we  extract  one  ihrecd  of  con¬ 
trol  for  each  address  space.  There  is  a  one  to  one  mapping  between  statements  in  the  3 
program  and  statements  in  the  o>  program.  The  order  of  execution  of  statements  in  the  3 
program  defines  a  total  ordering  of  operations  in  the  to  program,  whidt  is  used  for 
dynamic  order  restoration. 

We  first  define  a  debugger  which  forces  the  execution  order  of  statements  in  the  parallel 
program  to  be  in  the  same  order  as  in  the  3  program.  This  is  a  brute  force  method  of 
doing  order  restoration.  We  then  define  a  debugger  that  allows  the  parallel  program  to 
execute  unrestricted.  Depending  on  the  state  that  the  program  happens  to  stop  in.  the 
debugger  may  mx  be  able  to  allow  the  user  to  examiitt  or  modify  some  variables.  The 
debugger  must  ensure  that  it  provides  truthful  behavior;  it  cannot  supply  tmy  infonna- 
tioo  (o  the  user  that  contradicts  the  behavior  of  a  correct  debugger. 

We  describe  the  tbiead  splitting  transformation  and  the  language  of  3  programs  in  Sec¬ 
tion  4.1.  In  Section  4.2,  we  define  the  debugger  that  executes  statements  in  the  same 
CNTder  as  the  3  program.  We  describe  bow  to  alk)w  parallel  execution  in  Section  4.3.  In 
Section  4.4,  we  discuss  etnciency  issues  related  to  dynmnic  order  lestmatimL  Fmally, 
we  describe  some  work  related  to  the  issues  discussed  in  this  chapter  in  Section  4.5. 


SOURCC-UEVEL  OeaUQtSaiO  of  automatical  UY  PARALLEUZEO  PnOORAMS 


31 


Th*  thf— d  tpUMiifl  tnwiorwaon 


4,1  TTi#  thmad  splitting  tiansfonnatlon _ 

lluead  spUiting  is  a  sbnpk  traiisfoniiaikn  wbere  we  extract  multiple  prognmis  from  a 
siagk  program.  Eacb  statemeat  of  die  source  program  is  inariced  wiili  a  processcxr  iiidex 
labdtWiiidicaiM  which  processor  should  execute  tiiat  statement.  The  semantics  are 
chosen  so  that  after  thread  spUtiing,  we  have  a  parallel  program  with  the  smne  meaning 
as  the  source  program. 

^  flntdedne  the  Imigiisgeofp  programs  and  hs  semantics  la  Section  4.1.1.  We  then 
describe  bow  code  generadoo  woiks  fbr  thread  splitiing  in  Sectkm  4. 1.2  and  conclude 
with  an  example  in  Section  4.1J3. 

4.1.1  Thnlangungnfarpprogmmnandttnnnmantlcn 

Ibe  model  of  execmioo  for  a  P  program  is  that  of  a  single  thread  of  control  executing  in 
multiple  address  spaces,  SIMil  style.  Each  statement  is  annotated  with  the  address 
spaces  that  the  current  statement  should  operate  on. 

The  language  for  ^  programs  is  the  language  of  source  programs  augmented  with  addi¬ 
tional  constructs  for  amununication  and  the  specification  of  the  mapping  of  statements 
to  processors.  Its  semantics  are  the  same  as  the  source  program  with  some  exceptions 
noted  below.  In  the  thesis,  the  source  language  is  sequential  so  the  semantics  of  the  lan¬ 
guage  of  p  programs  are  sequential  as  well. 

4.1.1.1  ProcMaor  indux  luhaie 

All  statements  have  at  least  one  processor  index  labd  (PIL).  The  value  of  the  PIL’s  indi¬ 
cates  which  processors  execute  the  statement  m  the  parallel  program.  A  PDL  is  an  n  ele¬ 
ment  vector  for  an  n  dimensional  processor  array  and  must  be  a  coropile-time  constant 
If  the  value  of  a  PIL  is  not  a  valid  processor  name  (out  of  range,  non-integer),  then  the 
PIL  is  invalid.  Invalid  ML’s  can  cnly  occur  if  the  compiler  is  incorrect 

A  single  PIL  is  written  as  a  emnma  separated  list  of  integers  endosed  in  The 

length  of  a  PIL  is  the  same  as  the  dim^ionality  of  the  processor  may  If  there  is  more 
than  one  PIL  for  a  statement  they  are  separated  by  commas.  Some  examples  of  state¬ 
ments  with  PIL’s  can  be  found  below: 

<1>,<2>,<3>  b  a  1; 

<2>,<4>  c=2; 

There  is  one  distinct  address  space  for  every  valid  value  of  a  PIL  Variable  declaration 
statements  create  one  copy  of  a  variable  Cor  every  P!L  for  that  statement  The  PIL’s 
determine  which  address  space  those  variables  reside  in.  Other  statements  such  as 
assignments,  ccaditionals,  etc.,  always  reference  the  copy  of  the  variaUe  that  is  in  their 
namespace,  which  is  determined  by  the  PIL  for  that  statement  If  the  statement  has  more 
than  one  PEL,  side  effects  of  the  execution  of  the  statement  affect  every  address  space 
named  by  a  PEL.  Assignment  statements  have  straightforward  semantics  for  multiple 
PIL’s.  Fora  statement  like  the  foifowing,  the  effect  is  to  set  both  the  cqpy  of  c  in  <2> 
and  ttae  copy  of  e  in  <4>  to  the  value  2. 


32 


IKMIRCE-UVELOCBUGQMaOPAUrOlUTICAU^  PARALLiLQED  PfWORAMS 


Tbs  thraad  tramformaUon 


<2>,<4>  c  s  2; 

For  smancntt  with  oootrol  fiow,  such  as  loops  and  cooditioiials,  any  side  effects  in  tbe 
loop  beader  or  coDditioo  affect  every  address  space  named  by  a  PEL  For  exanpie,  in  tbe 
ciatMnwtf  below  ttae  side  effects  of  tbe  statement  are  to  initialize  the  variable  1  to  0  on 
loop  entry  and  to  increment  the  loop  counter  for  every  iteratjkm  of  the  loop.  The  seman¬ 
tics  of  the  stasement  are  to  initialize  the  copies  of  i  in  address  qiaces  0  and  1  to  0  on 
loop  entry  and  to  increment  both  copies  of  the  loop  counter  for  every  iteration. 

<0>,<1>  for  (i  a  0;  i  <  n;  i++) 

Note  that  having  the  same  variable  in  multipie  address  spaces  does  not  mean  that  ail  the 
copies  always  have  the  same  values.  It  is  Ic^  to  assign  one  copy  a  value,  but  not  oth¬ 
ers.  For  example,  the  following  statements  can  be  part  of  a  legal  program. 


<1>, <2> 

ink 

c 

<1> 

c  = 

2 

<2> 

c  = 

4 

Tbe  compiler  must  follow  3  rules  in  assigning  PIL’s.  If  they  are  not  obeyed,  then  tbe 
behavior  of  the  program  is  undefined.  Tbe  rules  are: 

1.  A  variable  cannot  be  referenced  in  address  space  punless  it  has  been  declared  in 
address  qnce  p. 

2.  Statements  with  data  dependent  control  flow  may  only  have  multipie  PIL's  if  tbe 
cooDoi  flow  is  the  same  for  aU  copies  of  the  statement  For  example,  tbe  result  of  tbe 
test  of  the  cooditionai  must  be  the  same  using  the  copy  of  c  in  <!>,  <2>,  and  <3>  in 
tbe  staiemem  below. 

<1>,<2>,<3>  if  (c  a=  2) 

3.  Wben  one  statement  is  lexically  nested  inside  another,  as  is  tbe  case  for  the  If  and 
for  statement  tbe  set  of  PIL’s  of  every  statement  inside  must  be  a  subset  of  the 
PIL’s  for  the  enclosing  statement  For  the  program  below  the  ctmdition  part  of  tbe 
If.  must  have  <1>  and  <2>  as  PIL’s  because  statements  contained  inside  have  those 
PIL’s. 

<1>,<2>,<3>  if  (a  s!=  1) 

<1>  c  s:  1; 

else 

<2>  c  =  2; 

4.1.1  Rep  and  dlat  eonstruetu 

It  is  impossible  to  write  programs  where  the  size  of  tbe  iHOcessor  array  is  not  known  at 
compik-time  because  PIL’s  must  be  compile  time  constants.  The  imi^ications  of  con¬ 
stant  PIL’s  are  discussed  in  CHAPTER  7.  TWo  run-time  variables  aid  two  amstnicts 
must  be  added  to  tbe  source  language  to  make  it  possible  to  create  programs  that  are 
parameterized  by  tbe  size  of  tbe  processor  array. 


SOURCE-LEVEL  OEBUQGMO  OF  AUTOMATICALLY  PAfULLEUZEO  PnOOflAMS 


33 


Tha  tlilwri  tramiomiatfcm 


Tlie  vwlabies  ate  _ap  and  _id.  They  ate  pre&ced  by  an  ladencoie  lo  sepacaie  tbem 
lhnBaaervariaUea.TbevariaMe.jiVcanau»iiiestzeofiiiepmeaaorarrayaadtsini> 
iUizedbydtenB^imesyatem.Tbevari^_ldcaacuDtiiieiadexortbecanentpfD- 
oeaaor  and  ia  alao  initialized  by  the  lUMnne  system.  Botb  die  size  of  tbe  anay  aod 
procesaor  indexes  can  be  wdfanensiooalmanbefs  fora  dintensioiial  processor  arrays.  In 
Ibis  case  _B9  and  .Id  are  a  ekment  am^ 

Hie  OQOStnicts  are  rap  and  diat.  They  are  purely  a  shordiand  for  writing  programs 
tbat  are  compact  and  independent  (rf  the  size  of  tbe  processor  array.  The  rap  is  used  to 
aeaae  a  staiemeas  with  muldpie  PIL’s  and  the  diet  is  used  to  create  mubipie  copies  of 
a  siaiemeni.  each  with  a  different  HL.  Loop>likB  control  is  lued  to  generate  offish 
wtaicta  are  added  to  die  PIL’s  of  statements  in  tbe  body  to  generate  distinct  FTL’s  for 
every  copy.  The  syruax  the  two  statements  is: 

rap  vor  3  low  to  high  v±th  o0etbody 

diet  vor  3  low  to  high  with  offset  body 

Tbkens  in  italics  are  syntactic  variables.  The  vor  is  a  variable  name.  The  low  and  high 
are  expressions  that  are  functions  of  constants,  vaiiaMes  defined  by  a  dlat  or  rap. 
and  jp.  The  ojgter  is  a  HL  that  is  a  function  of  constants,  vor,  and  .np.  Fbr  a  dlat, 
body  is  a  statement  or  biodt  of  statements.  For  a  rap.  bocfy  is  a  single  statement  The 
scope  of  vor  is  body.  The  vor  may  be  referenced  in  low  and  fugh  expressions  in  rap  and 
diet  statements  in  the  body,  but  not  in  any  other  statement  of  the  body. 

The  rap  and  dlat  are  spedal  .statemenis  which  are  not  executable  and  do  not  have 
their  own  PIL’s.  They  do  change  the  semantics  of  executable  statements  contained  in  the 
body.  A  dlae  is  used  to  create  midtqde  copies  of  a  blodc  of  statenaents  which  can  then 
be  distributed  across  a  set  of  processors.  Om  copy  of  tbe  body  is  creamd  for  each  possi¬ 
ble  value  of  the  dlae  variable,  which  is  bounded  finn  below  and  abovehy  low  and 
high  with  a  step  size  of  1 .  The  copies  of  tbe  Mock  are  execined  sequentially  in  tbe  order 
that  they  ate  generated.  The  (ff^set  is  added  to  the  P1L  of  every  statement  in  the  body. 
The  PIL’s  in  tbe  body  can  be  any  value.  The  offset  can  have  a  different  value  for  every 
copy  of  the  body  because  it  is  a  fonctioa  of  vor.  As  an  example,  in  tbe  dlat  below, 
three  copies  the  body  are  created.  The  value  of  the  offiret  vectors  are  <0.0>,  <  1,Q>. 
aod<2,0>. 

dist  b  3  0  to  2  with  <b>  ( 

<0>  C  3  1; 

<0>  D  3  2 ; 

) 

When  we  expand  tbe  above  dlae  we  obtain: 


34 


S0URC64JEVEL  OEBUQGMQ  OF  AUTOMATIOAIXY  PAmUJELOED  PROGRAMS 


Tiw  IhnMd  apllltlng  tranafonMlien 


{ 

<0>  C  »  1; 

<0>  D  >  2; 

> 

{ 

<1>  C  »  1; 

<1>  0*2; 

} 

{ 

<2>  Cal; 

<2>  O  a  2; 

} 

A  r«p  is  used  to  geocfate  multiple  HL’s  for  a  single  statement  One  PIL  is  generated 
for  every  possible  value  of  tbe  rap  vaiable  by  adding  the  current  value  of  tbe  set  to 
tbe  PIL  for  tbe  statement  that  is  the  body  of  tbe  rap.  In  tbe  statement  below,  tbe  rap 
creates  3  POL's  for  the  statement 

rep  a  a  0  to  2  with  <a> 

<5>  b  a  1; 

If  we  expand  tbe  above  rap,  we  obtain: 

<5>,<6>,<7>  b  a  1; 

A  rap  only  generates  tbe  PIL's  for  a  single  statement  even  if  that  statement  contains 
other  statements.  For  example,  if  the  body  of  a  rap  is  a  for  statement  then  multiple 
POL’s  are  generated  for  tbe  for  statement  but  tbe  body  of  tbe  loop  is  WH  changed.  Hiis 
makes  it  posstUe  to  replicate  loop  control  across  a  set  of  processors  but  still  distribute 
tbe  body  of  tbe  loop.  If  we  want  to  replicate  the  body  of  tbe  for  as  well,  then  a  rap 
must  be  used  for  every  statement  of  the  body,  bi  tbe  example  below  we  want  to  rqtlkate 
tbe  forstatemeatanditsbodyooeveryprocessor  of  a  2  processor  linear  array.  Each 
statement  must  have  a  rap. 


<0> 

rep  a  3  0  to  1  with  <a> 
for  (i  3  0;  i  <  n;  i++) 

( 

rep  a  3  0  to  1  with 

A 

e 

V 

<0> 

b<  +; 

rep  a  3  0  to  1  with 

<a> 

<0> 

C++; 

) 

If  we  were  to  expand  tbe  rap’s  in  tbe  above  program  we  would  obtain: 

<0>,<1>  for  (i  =  0;  i  <  n;  i++)  ( 

<0>,<1>  b++; 

<0>,<1>  C++; 

) 


SOURCE-LEVS.  OORMtaPia  OF  AUTOMATICALLY  PARALLEU+EO  PROORAMS 


36 


TTw  tlir— d  ■ptltting  trtwtonntkHi 


1b  staqiiify  code  generaiioii.  we  require  diat  die  prognm  for  eadi  processor  be  MkDti* 
CiL  Tliis  occurs  «»n<w  the  following  caD(titk>iL  If  we  expand  Ml  the  xmp  and  diet  coo- 
stractt  of  a  program,  then  for  every  stasemeot  r  in  the  program  and  fiv  every  processor 
index /r  in  the  processor  array,  there  should  be  exactly  one  copy  of  statement  J  in  the 
expaided  program  that  has  p  as  one  of  its  nUs.  We  list  two  examples  of  vaUd  programs 
below.  Id  the  first  program,  there  is  exactly  one  copy  of  every  statement  for  every  pro- 


diet  a  ■  0  to  ^p[0]-l  with  <a,0> 
diat  b  a  0  to  with  <0,b> 

<0,0>  i  a  1; 

Expandlag  this  program  for  a  2x2  processor  array  yields: 

<0,0>  i  a  1; 

<0,1>  i  a  1; 

<1,0>  i  a  1; 

<1,1>  i  a  1; 

In  the  following  program,  there  is  less  than  one  copy  of  each  statement  for  each  proces¬ 
sor;  but  each  copy  has  multiple  PIL’s. 

diet  a  a  0  to  .np[0]-l  with  <a, 0> 
rap  b  a  0  to  ,.np(l]-l  with  <0,b> 

<0,0>  ial; 

Expand  the  rap  and  diet  constructs  for  a  2x2  processor  array  yields: 

<0, 0>,<0, i>  i  a  1; 

<1,0>,<1,1>  ial; 

Ibe  effea  of  this  lestrictioo  is  that  every  Statement  of  the  ^  program  is  executed  in 
every  address  space.  Of  course,  in  some  situations  we  want  only  a  single  processctf  to 
execiNe  a  statement  Ihe  restrictaon  does  not  prevent  it  In  FIGURE  4-1  there  is  an 
example  of  a  p  program  that  places  a  different  statement  on  every  processor;  and  below 
it  is  the  program  rewritten  so  that  the  same  code  can  be  used  on  every  processor. 

Ibis  restrictioo  sinqilifles  code  generation  because  it  allows  us  to  emit  a  single  program 
lor  an  processors;  it  does  not  place  an  undue  burden  on  the  compiler  writer;  All  the 
compilm  that  we  are  familiar  with  abeady  generate  a  single  program  for  all  prooes- 
sorsfRibas  901[Cft  QOKSussnmn  91][ChiuiBriee  91]nheng  89].  We  bdieve  that  most 
conqnlets  do  not  generate  different  code  for  each  processor  because  it  makes  Udificult 
to  make  a  program  that  can  be  run  on  a  varying  nnndier  of  processon.  Funhermore,  it  is 
much  more  expensive  to  compile  and  load  many  programs  versus  compiling  and  load¬ 
ing  a  single  program  for  all  processon. 

4.1.U  Communiemion 

During  the  course  of  a  computation,  data  must  be  moved  finom  one  processor  to  another. 
This  data  movement  most  te  expressed  in  the  P  program  as  a  movement  between 


Souncx-UEVB.  QCBUOCHNa  OF  AimHUTKAIXY  l-iUUUJElJZED  moORMM 


Th«  thwd  ip4lttlnq  tmwfonwrtoii 


FIGURE  P  program  that  maps  (ilfwvfltttalarmnti  to  MchprecMsor  and  an  •qulvatont^ 

program  that  maps  tha  sama  statamonts  to  sach  procaasor 


<0>  A  a  1; 
<1>  A  a  2; 


diat  b  s  0 

to 

_npt0]-l 

with 

A 

XX 

V 

<0> 

if  (_idC0] 

3« 

0) 

<0> 

A  3  1; 

dist  b  3  0 

to 

_np[0]-l 

with 

<b> 

<0> 

if  Cidtll 

sa 

1) 

<0> 

A  3  2; 

address  spaces.  For  this  reason,  we  include  aaad  and  racaiT*  primitives  in  the  lan¬ 
guage  of  0  programs. 

In  the  P  program,  commimication  must  be  inserted  into  the  program  so  that  a  aaad  is 
always  executed  before  its  conespooding  raealTa.  If  the  aaird  is  not  executed  before 
the  racalTa,  then  the  program  may  deadlock  during  debugging.  Furthermore,  the 
PIL’s  of  the  sending  and  receiving  processors  must  match  the  counectivity  of  the  proces¬ 
sor  array,  bi  our  mesh  connected  W  anay,  the  value  seat  in  the  statement  bdow: 

<3,5>  send( (0, -1)  ,7) 

can  only  be  received  by  a  statement  with  a  P1L  of  <3,4>  that  does  a  receive  with  direc- 
liaa  (0,1). 

To  simplify  our  examples,  we  sometimes  group  several  cooimunicatioo  actions  together 
into  a  single  primitive.  For  example,  instead  of  constructing  a  broadcast  operation  out  of 
send  and  receives,  we  use  a  broadcast  primitive. 

4.1,2  Coda  gtNwmtion 

When  we  apply  thread  splitting  to  the  P  program,  we  extract  a  program  for  every  pro¬ 
cessor.  Hie  program  for  processor  p  consists  of  ail  the  statements  in  the  p  program  with 
aPILofp. 

We  fcquire  that  there  is  exactly  one  copy  oi  every  statement  in  the  P  program  for  each 
processor  so  code  generatioo  is  straightforward;  the  program  for  a  processor  is  the  same 
as  the  P  program  with  the  PIL’s,  rap,  and  dint  constructs  removed. 


SOURCE-LEm  DEBUOQMG  OF  AUTOIIATICAU.Y  PARAUJEUZED  faibOlUWI 


37 


As  a  egunqiie;  if  we  ^iply  itifead  spUoiDi  lo  die  seond  propn  in  HGURE  4-1,  ^ 
woold  obttk  ite  fioQowiog  09  prognon: 

if  {_id[01  ««  0) 

A  «  1; 

if  (_idtl]  «  1) 

A  a  2; 

Tbe  P  pragnm  beiMves  as  if  ibere  it  e  single  tfaiead  of  conirol  executing  in  muM^iie, 
(fiijaiDt  addrett  S|»oes.  Alker  dnead  spUsiag,  tbe  lesolting  CO  prognm  is  a  set  of  inde¬ 
pendent  tbreadt  eadi  executing  in  its  own  address  s])aoe.  bi  the  second  prognm  io  FIG- 
URE4-1.  titewMndaofthelsngiiagefor  ppiogiaMsinipiyibtt  both  copies  erf  the 

ttmtmnfnt  la  Mmiiaiantiily  In  tha  n>  iwngrui  Hi*wt  k  nn  ^ywchmiitta. 

tion  to  eefiace  dds  oonstnint;  processor  0  could  execute  the  statement  before,  at  the 
sane  time,  or  after  processor  1  executes  its  copy  (tf  the  same  stasemenL 

4.1.3  Exnmpln:  loop  pipelining 

hi  this  section,  we  presem  an  example  that  demonstrates  the  use  of  PIL's.  rap,  disc. 

atirf 

Wb  Stan  srith  the  a  program  found  in  part  of  FIGURE  4-Z  We  use  loop  pipelining  to 
(fiatribute  the  work  to  processors,  bi  loop  pipelining,  the  work  of  a  body  of  a  single  iter* 
ation  is  spread  across  sevecsl  processon{Sussman  911.  Intennediate  values  in  tbe  loop 
keration  are  passed  fiom  processor  to  processor;  pen  (b)  of  FIGURE  4*2  illustrates  tbe 
flow  of  data  between  pcocesson. 

Ran  (c)  of  FIGURE  4-2  shows  the  P  program;  the  program  to  the  right  is  the  same  pro¬ 
gram  where  the  rap  and  disc  have  been  expanded.  Each  saaement  has  been  annotated 
with  tbe  appropriate  PIL.  The  variable  dedaiaiioo  has  been  placed  inside  a  rap  so  that 
each  processor  can  have  its  own  copy.  The  for  loop  is  mapped  to  all  processors,  so  it  is 
piaoed  inside  arap.  Each  processor  must  follow  a dUferent  execution  path  through  tbe 
body  of  tbe  loop,  so  the  body  is  placed  inside  adlnt.  Since  the  value  of  A  is  produced 
in  one  processor  and  consumed  in  another,  its  value  is  transfened  between  processors 
using  aand  and  racalra. 

Ibis  program  iUustrates  tbe  rales  for  using  PIL's  and  commimicmioa.  For  communica- 
tion.  the  sends  must  appear  before  receives.  Processor  0  sends  a  value  and  processor  1 
receives  it,  so  the  dlat  is  written  so  that  the  copy  of  the  code  for  processor  0  is  before 
processor  1.  If  the  program  were  changed  so  tbat  processor  I  sent  the  value  to  processor 
Q.  then  the  dlac  would  have  to  be  changed  so  that  tbe  copy  for  processor  1  appears 
before  the  copy  for  processor  O.Tbis  can  be  done  by  cbmging  the  in  the  dlat  to 

<l-b>. 

Rule  1  for  PIL’s  requires  that  a  variable  must  be  declared  in  every  address  space  in 
wfakfa  it  is  used,  rale  is  satisfied  by  our  example  and  any  otto  valid  program 
becauae  evmy  statement  muat  have  a  copy  on  every  ptooessor  thus  every  variable  is  by 
defiuitt  dedaied  in  every  address  ^wce.  - 


>OUWCg4JP«.OgKICM>iOOPMITOIUT^^  PARatixlXgD  PWOOiUllS 


iiMi  TOMO  i|Niiuny  niMioniHDOfi 


HGURE  4-2 


Thraad  apttting  «xampi«  for  loop  pipolining 
(t)apn)gnm 


int  i,A,bt30].c[301,gt30]; 
for  (i  «  0;  i  <  30;  i++)  { 
A  «  b{i]  *  cti]; 
g[il  -  A  *  2; 

} 


(b)  auffriag  of  opeiatioiis  to  processon 

1 


(c)  p  program  and  its  expanded  venion 


rep  a 

s  0  to  1  with  <a> 

<0>, <1> 

int  i, A,b[30]  ,  c[30]  ; 

<0> 

int 

i,A,b[30],c[30] ; 

<0>, <1> 

for  (i  =  0;  i  <  30;i++)  { 

rep  a 

=  0  to  1  with  <a> 

<0> 

if  (_id[0]  ==  0)  { 

<0> 

for 

(i  =  0;  i  <  30;i>+) 

<0> 

A  a  b[i]  *  c[i] ; 

diet  b  s  0  to  1  with  <b> 

<0> 

send( (1) , A) ; 

<0> 

if  (_idC0]  ==  0)  { 

}  else  ( 

<0> 

A  =  b[i]  *  cti] ; 

<0> 

receive ( (-1) ,A) ; 

<0> 

sand( (1) , A) ; 

<0> 

g[il  =  A  *  2; 

}  else  ( 

} 

<0> 

receive ( (-1) , A) ; 

<1> 

if  CidtO]  ==  0)  { 

<0> 

gtiJ  .  A  *  2; 

<1> 

A  a  b[i]  *  c[i]  ; 

) 

<1> 

3end( (1) , A) ; 

}  else  { 

<1> 

receive { (-1) ,  A)  ; 

<1> 

g(i]  =  A  *  2; 

) 

> 


(d)  09  program 

int  i,A,b(30] ,c(30] ; 
for  (i  s  0;  i  <  30;i++) 
if  (_id(0]  =»  0)  { 

A  =  b(i]  *  c[i] ; 
send( (1) , A) ; 

}  else  ( 

receive ( {-!) , A) ; 
g(ij  »  A  *  2; 

} 


SOUnce-LEVEL  DEBUQQMQ  Of  AUTOMATICALLY  PARALLELOED  PROGRAMS 


39 


Tha  thraad  apUniiig  traiMfoniwtlM 


Role  2  for  PIL’s  inpiires  that  a  sottmott  widi  datt  depcMktt  control  Ikm  and  ouil^ 
Pit’s  behave  the  same  for  evoy  POL  The  for  stahaneat  obeys  thisnik  becuM  we 
know  that  the  copy  of  the  vaiiaUe  i  in  <0>  is  alwiqrs  the  same  as  the  copy  of  the  van* 
aUe  in  <!>.  hence  the  bnmching  of  the  for  is  always  the  same  in  aU  address  spaces. 

Rule  3  for  Pit’s  specifies  that  if  one  statement  is  nesaed  inside  another  the  Pit’s  for  the 
inner  saaement  most  be  a  subset  the  Pit’s  for  the  outer  statemenL  The  body  of  the 
for  loop  containa  statements  with  a  Pit  of  <0»  and  <1>,  so  the  for  hx>p  header  must 
have  Pit’s  that  include  <(]>  and  <1>. 

The  result  of  thread  s|ditting  is  shown  in  pan  (d)  of  HGURE  4-2. 

4^  The  D  function  for  thread  splitting _ 

The  p  program,  which  is  the  source  program  for  the  thread  spUiting  transformatioa, 
defiiiM  a  total  order  of  operations  in  the  to  program.  Thread  flitting  removes  ail  the 
order  constraints  between  processors  that  aren’t  the  result  of  commimicaiion.  If  the  (D 
program  is  intetnipted  duiii^  execution,  either  by  a  user  intemipt,  program  exception, 
or  breakpoint,  then  the  state  of  the  program  is  not  necessarily  consistent  (see  Definition 
2-1)  with  any  state  of  the  p  ptogram  because  there  may  be  early  or  late  operations.  The 
debugger  for  thread  splitting  must  do  dynamic  order  restoration  to  provide  correct 
behavior  (see  Definition  2-3).  One  way  to  do  dynamic  order  testoration  is  to  forte,  die 
execution  of  operations  in  the  (0  program  to  be  in  the  same  order  as  they  are  executed  in 
the  p  program.  Since  the  P  program  has  sequential  semantics,  we  must  execute  the  cs 
program  one  operation  at  a  time.  call  tins  executing  with  a  sequential  schedule. 

An  alternative  to  executing  with  a  sequential  schedule  is  to  allow  the  program  to  exe¬ 
cute  in  parallel  and  detect  when  the  program  is  not  in  a  state  that  is  consistent  with  a 
stateofthe  P  program.  The  debugger  can  then  prevent  the  user  firom  doing  mything 
that  would  lead  to  incorrect  behavior,  such  as  examining  a  variable  when  it  has  been 
updated  by  an  early  operation. 

Eithermetbod  requires  a  way  for  the  debugger  to  determine  at  run-time  what  order  the  P 
program  would  have  executed  the  operations  of  the  a>  program.  Executing  with  a 
sequential  schedule  foOows  this  ordering  when  determining  what  statement  rat  what 
processor  executes  next  If  we  aOow  parallel  execution,  then  the  ordering  is  used  to 
determine  if  there  are  any  late  or  early  operations. 

The  ordering  is  determined  by  assigning  votual  times  to  the  execution  of  statements.  By 
comparing  the  virtual  times  of  statements,  we  can  decide  which  one  should  be  executed 
first  In  the  rest  of  this  chqMer,  we  win  also  use  the  term  the  virtual  time  of  a  processor. 
The  virtual  time  of  a  processor  is  the  virtual  time  of  the  statement  that  the  processor  will 
execute  next 

In  Section  4.2.1,  we  define  the  properties  that  a  virtual  time  must  satisfy  and  describe 
bow  to  compute  the  virtual  time  ci  a  statement  will  then  show  how  the  virtual  time 
can  be  used  to  construct  aD  fonction  which  executes  the  program  with  a  sequential 
schedule  in  Section  A22.  In  Section  4.3  we  show  bow  to  allow  the  program  to  execute 
in  parallel  while  debugging. 


40 


aotmcf-LEVCLoeBuoaNaor  AUTOMATia^  paralleuzid  moorams 


TIm  0  funetfon  for  ihraad  apOttifig 


4^1  Computing  virtual  timaa 

EacfaeMcuticM  of  a  statement  is  called  an  instaPCT  of  a  satanent  The  execution  order 
of  instances  of  the  p  program  defines  a  total  ordering.  There  is  a  1  to  1  mapping  between 
instances  in  the  P  program  and  instances  in  the  to  program. 

We  want  to  anipi  virtual  times  to  the  execution  of  statements  in  the  ca  program  so  that 
we  can  determine  what  order  they  would  have  been  executed  in  the  P  program.  Each 
inatapce  of  a  statement  in  the  pptograan  is  assigned  a  virtual  time  greater  than  the  vir- 
tual  time  of  the  previous  instance.  More  formally: 

EtpiaSkMi  4.1  occurs  before  sg  in  the  P  program  «>  viilual_time(Pi)  <  virtuai_tlme(P2) 

Si  and  sg  are  statement  instances  in  the  p  program.  Pi  and  P2  ate  their  corresponding 
statement  instances  in  the  to  program,  respectively.  If  two  statements  on  differem  pro¬ 
cessors  have  the  same  virtual  time,  then  both  of  those  statements  are  derived  Grom  the 
same  statement  in  the  P  program. 

ha  Che  following  sectitw  we  present  a  method  for  computing  virtual  times;  this  method 
satisfies  Equatioo  4.1. 

4.2.1.1  Computing  virtual  timae  for  tha  p  program 

First,  we  describe  a  method  for  assigning  virtual  times  to  statement  instances  in  the  P 
program  so  that  the  virtual  time  of  instances  are  strictly  increasing  during  execution  of 
the  program.  Next,  we  describe  bow  we  can  assign  the  same  virtual  times  to  the  corre¬ 
sponding  statement  instances  in  the  ca  program. 

The  virtual  time  of  aprocessw  is  a  tuple  of  2a  -f  1  numbers  where  a  is  the  maximum 
level  of  nesting  of  loops  in  the  program.  It  is  represented  below  as  a  vector  called  vtime. 
A  program  without  any  loops  has  0  levels  of  nesting  and  hence  has  one  counter  A  vir¬ 
tual  time  ti  is  greater  than  a  virtual  time  fj  if  r,  is  lexicographically  greater  than  tj- 
Virtual  time  is  advanced  by  executing  statements  of  a  luogram.  Virtual  time  can  be  com¬ 
puted  by  following  these  rules: 

1.  Initially,  nerting  a  0 

2.  After  entering  a  loop,  ACrttAg-H' 

3.  After  exiting  a  loop,  v{iAie[2*AesrtAg]  a  0,  viime[2*nesting+l]  a  0,  nesting- 

4.  After  branching  to  the  top  of  a  loop,  vrune[2*AertiAg]-H> 

5.  For  each  statement,  v(iAie[2*ne5ting+13  a  line  number  of  the  statement  about  to  be 
executed 

For  a  dint  statement  in  a  p  program,  each  copy  of  the  body  must  have  a  virtual  time 
greater  than  the  previous  copy.  A  dint  is  treated  like  a  lo(^  triiere  each  c<H>y  of  the 
body  is  a  loop  iteration.  When  entering  a  loop,  vtim^nea]  is  the  line  number  of  the 
dint  and  nesting  is  incremented  by  1.  For  every  copy  of  the  dint,  vtime[Z*nest*l]  is 
the  value  of  (be  dint  variable,  wfak±  is  strictly  increasing.  Since  all  copies  of  a  state¬ 
ment  in  a  rap  correspond  to  the  same  statement,  a  rnp  does  not  need  special  treatment; 
every  copy  created  by  the  rap  has  the  same  virtual  time.  FOr  the  program  in  FIGURE 
4-3,  the  sequence  of  virtual  times  for  an  execution  where  k  >  lis: 


Hi 


SOURCe-tjm0EmJQC(V4OOFAim>IIATICAU.Y  PARALLEU29ED  PnOQRAIM 


TIm  ttmd  •pilWiHI  tnnafoniMlioii 


a  a  0, 0, 0),  (3,  a  4. 0, 4),  o.  0, 4,  a  S).  o,  0, 4,  l,  4),  o,  0. 4,  l.  5).  (3. 1, 4, 0. 4), 
a  1. 4. 0. 5),  (3. 1. 4, 1. 4).  (3. 1. 4. 1. 5) 


O.U  Computing  th*  virtual  ttaiM  of  a  ■talMiMMit  during  Mneution  of  th«  00  program 

When  a  prooesaor  stops  executing,  tbe  debugger  must  oompute  tbe  virtual  tone  die 

cunent  statement  We  want  to  precompute  as  much  of  the  virtual  time  for  a  statement  as 
possSile.  Id  tbe  (0  prognun  of  FIGURE  4>3.  die  left  band  cotunm  is  the  vimto/ tune  rem* 
A  variahift  in  thg  viTOial  lilwa  teatplMB  indkatea  that  tlie  value 

riKwibl  be  wdtiwi  Crom  a  diet  variable  or  a  loop  counter.  The  value  of  the  1st,  Srd.  Stb, 
....  2iH'1  postdons  of  tbe  virtual  dme  csn  be  determined  staticaOy  fnxn  ttae  program 
counrer  s^  are  part  of  the  virtnal  time  teosplaie.  The  value  of  counters  fora  dlst  can 
be  determined  ftom  the  processor  index  which  is  also  static  infonnatioa.  and  tbe  value 
of  counters  for  loops  can  only  be  determined  at  nm-time. 


RGURE  4i>3  Saquantial  and  paraiM  program 

a  program: 

11:  k  =  1; 

12:  ford  =  0;  i  <  4;  i++) 

13:  Ati]  »  i; 

P  program: 

1  rep  a  a  0  to  _np(0]  with  <a> 

2  <0>  ll:k  *  1; 

3  dist  b  a  0  to  «.np[0]  with  <b> 

4  <0>  12:  for  {is_idt01*2;i<(_id[0]+l)*2;  i++) 

5  <0>  13:  A(i-.idtOJ*21  »  i; 

QO  program: 

(2, 0.0, 0,0)  ll:k  =  1; 

(3,b,4,i,4)  12:for  {i»_idI01 *2; i< (_id[0 J +1) *2;  i++) 
(3.b,4,i,5)  13:  A{i-_idt01 *2]  =  i; 


The  value  of  tbe  counters  that  are  associaied  with  loops  are  the  number  of  times  the 
kxips  have  executed.  This  can  usually  be  determined  from  tbe  value  of  (be  loop  counters 
ibat  already  exist  in  tbe  progimn.  In  tbe  example,  i  can  be  used  for  tbe  counter  in  tbe 
fourth  positkm.  Loop  counts  for  virtual  time  always  Stan  by  0  and  are  incremented  by  1. 
Tbe  counters  for  loops  written  by  tbe  user  may  not  necess^y  do  this.  In  tbe  thesis,  we 
assume  tbat  all  loop  counters  ate  normalized  so  that  they  always  begin  at  0  and  have 
increments  of  1. 

Tbe  ooimter  value  for  a  dlae  construct  is  more  difficult  to  determine.  Tbe  diet  vari¬ 
able  is  not  a  real  variable  thm  exists  in  tbe  progimn.  Tbe  processor  index  label  of  a  state¬ 
ment  inside  a  rap  or  dine  is  a  function  that  maps  tbe  value  of  tbe  rap  and  dlst 
variables  to  a  processor  index.  We  want  tbe  inverse,  a  fonctioa  that  maps  tbe  processor 
index  to  tbe  value  of  tbe  dist  variables.  Since  we  require  that  every  statement  be 
mapped  to  every  processor  exactly  once,  we  know  that  tbe  processor  indct  label  fimc- 


42 


SOtIRCe-UEVEL  DEBUOQMQ  OP  AimMIAIKALLY  PARAUEUZB)  mOOnuMS 


TIm  Dfunetioa  for  Ihraad  •pflttiiig 


tioo  is  a  1  to  1  aod  oolD  mapping  to  rop  aad  dlot  variables  and  must  have  an  inverse. 
We  use  a  bniiB  fence  appio^  to  find  the  inverse.  We  compute  all  the  pco^ibie  combioa- 
dons  of  values  of  cop  and  dint  variables  for  a  statement,  and  compute  the  associated 
processor  index.  If  we  pot  the  results  in  a  t^le  indexed  by  the  processor  index,  we  can 
compute  the  cop  and  dlot  variables  from  the  processor  index.  The  tritle  has  a  fixed 
size  because  there  can  only  be  oue  entry  per  processor.  For  the  example  in  FIGURE  4>3. 
the  dish  variable  b  always  has  the  value  0  on  processor  0,  and  1  on  processor  1. 

Id  the  example  of  FIGURE  4-3,  the  virtual  time  of  a  processor  executing  the  first  state¬ 
ment  is  (2,0,0,0,0).  If  processor  0  is  stopped  a  the  second  statement  and  the  value  (tf  the 
loop  counter  is  0,  then  the  virtual  time  is  (3,0,4.0,4).  If  processor  1  is  stof^red  a  the  third 
and  the  vriue  of  the  loop  counter  is  3.  then  the  virtmil  time  is  (3,1.43,5). 

The  discussion  above  assumes  tha  the  entire  program  is  a  single  procedure.  Our  virtual 
rimy,  method  can  be  extended  to  programs  with  procedure  call  aiHl  recursion  in  a 
straightforward  manwif  Within  a  procedure,  virtua  time  is  computed  the  same  way.  We 
shall  call  this  Vp^  for  procedure  virtual  time.  As  part  of  the  normal  calling  sequence, 
every  rime  a  procedure  is  «tiieri  the  program  saves  the  return  address  on  the  program 
stack.  For  each  return  address  on  the  stack,  we  compute  a  .  The  virtual  time  of  the 
program,  or  is  computed  by  budding  a  vector  of  Vp^  ’s,  where  tbe  first  one  is  the 
one  deepest  on  tbe  stack  (earliest  in  time),  the  second  element  is  the  second  deepest  on 
the  stack,  and  so  on.  The  virtual  time  is  now  a  vector  of  vectors.  Recursion  does  not 
change  the  way  the  virtual  time  is  computed.  Virtual  times  are  can  be  ordered  using  a 
lexicographic  comparison. 

As  mentioned  in  tbe  previous  chapter,  program  flow  graphs  must  be  reducible.  This  is 
because  the  virtual  time  scheme  we  presented  assumes  tbe  program  is  structured,  tbere 
must  be  an  identifiable  loop  structure. 

4,2,2  D  (Unction  for  thnad  splitting  with  a  saquantiai  schaduin 

If  we  execute  the  paiaOd  program  according  to  a  sequential  schedule,  then  no  ruiditional 
order  restoration  need  be  done;  tbe  debugger  must  only  do  structural  mapping.  We  can 
execute  with  a  sequential  schedule  by  executing  tbe  statements  in  virtual  time  order. 

Structural  mapping  for  thread  splitting  is  straightforward  because  of  the  simple  mapping 
between  statements  and  variables  in  the  P  aid  a>  programs.  The  restriction  that  all  {Mt> 
grams  of  the  co  program  be  identical  simplifies  the  problem  further.  There  is  one  cqiy  of 
every  variable  in  every  name  space  and  one  copy  of  every  statement  for  every  processor. 
If  the  user  sets  a  breakpoint  on  a  statement,  then  we  want  to  set  a  breakpoint  on  every 
ctqiy  of  that  statement  in  each  address  space.  If  the  irttur*  command  is  used  to  inspect 
the  current  locatioo,  we  use  tbe  location  of  the  processor  with  the  lowest  virtual  time 
because  this  is  the  stmement  that  is  executed  next  If  tiie  user  wants  to  inspect  OT  modify 
a  variable  in  the  P  program,  they  must  specify  which  copy  they  want  A  copy  of  a  vari¬ 
able  can  be  referenced  with  its  name  in  tbe  address  space  and  tbe  imxessor  index.  For 
examine  if  tbe  user  wants  to  inspect  tbe  ropy  of  the  variable  C  on  processor  <1 ,2>,  tbey 
would  uro  tbe  name  (C,<13>)- 


SOURCC-UEVEL  DCBUOrUNO  OF  AUTOMATICALLY  PARALLEUZED  PROORAMS 


43 


Tlw  thimd  apllttifig  transformatioii 


Tbe  D  (unctioo  dM  executes  acconting  to  a  sequeotial  sctaedule  is  Qsted  below.  Tbe 
paiaUd  program  state  is  a  sequence  of  slates,  one  for  each  process.  Tbe  argument  ffagct 
is  a  list  of  debugger  stadcs,  (xte  for  eacb  processor.  Tbe  agument  program  is  a  list  of 
programs,  one  for  each  processor.  The  ad^pace  fiaxrtioa  detemiiiies  the  index  ol  the 
processor  that  a  particular  variable  is  a  membet  The  function  min_virtjime  returns  the 
index  ttf  the  processor  with  the  minimiim  virtual  time.  The  select  function  pidcs  one  ele* 
ment  out  of  a  list,  based  on  the  index  siqrpiied.  It  is  used  to  extract  the  appropdate 
ddwgger  stadc  fium  the  rfitoci  parameter. 

For  every  oonmuBKl  but  run,  the  debugger  deddes  which  processor  this  command  acts 
on  and  passes  on  the  command  to  that  processor.  If  axaalnn  or  ant  are  used,  then  the 
processor  sdected  is  the  one  on  which  the  variable  resides.  For  the  lAmm  and  run 
the  processor  selected  is  the  one  with  the  lowest  virtual  time. 

For  the  run  command,  we  repeatedly  single  step  the  program  until  we  reach  a  Ixeak- 
point  or  the  program  terminates.  After  every  step,  we  must  determitK  which  processor 
should  be  the  next  one  to  single  step. 

debugger(dstack.cominan(l,state,iBogram,bptsjiame.vaiue) 

{ 

/*  decide  which  processtv  this  command  acts  on  *! 
if  (command  «  where  I  command  »  step) 
index  a  min_vut_time(state); 
else  index  «  ad_q»ce(naDae); 

/*  select  the  debugger  stack,  processor  state 
and  program  for  tbe  processor  */ 
stadtl  a  selectCdstacktindex); 
statel  a  seiect(staie,index); 
programl  a  selectCprogramUndex); 
debt  a  first(stadtl); 

I*  pass  tbe  command  to  the  lower  level  debugger  *! 
if  (command  as  run) 

I*  single  step  until  we  bit  a  breakpoint 
or  the  program  terminates  *! 
locationaappiy(debl4est(stackl), 

wbere^statel,programl,bpts,Dame,  value) 
while  (location  not  a  member  of  bpts  and 
no  exceptions  yet)  ( 

statel  a 

iq)ply(debljiest(stadcl), 

nin4tatel.programl.all.naine,value); 

merge  statel  into  state 

/*  decide  which  processor  to  act  on  */ 

index  a  mta_virt_tinie(statB); 

/*  select  the  new  processor  state,  ddxigger  stack 
and  program  */ 


SOUBC&IEVEL  D6BUQCBNa  OF  AUTOMATICALLY  PhHAl.i  tflllH)  PtWORAIh. 


PmM  •xseutlon  whto  dabuggiiHi 


stKkl  a  sekct(d«ack,index); 
staiel  a  select(state,index); 
prognanl  •  seiectCprograaModex); 
debl  a  first(siackl); 

kKSUioa 

aapiriy(debl,ftst(stackl), 

wbefe,stuel4)rogrml,bpCMCtaiDe,  value) 

} 

merge  staiel  into  state 
letum  state 

if  (commaiid  aa  where)  { 

return  apply(d^l4est(stadcl). 

where.state.prDgiaml.bpts.name.  value) 

if  (ctmunand  aa  examine) 

return  apply(debl4est(stackl). 

examine.statel.programl,bpts4>ame.value); 
if  (command  a*  set)  { 

statel  a  apply(debl.rest(stac:i.l). 

setstate  l.program  1  ,bpts4iame.  value); 
merge  statel  into  state 
return  state 

) 


4.3  Parallel  execution  while  debugging 


While  executing  the  parallel  program  with  a  sequential  schedule  permits  full  debugga- 
bility.  it  has  the  drawback  that  execudon  is  very  slow.  Another  option  for  dynamic  order 
restoration  is  to  let  the  program  run  in  parallel,  and  detect  when  the  program  in  not  in  a 
state  that  is  consistent  with  a  state  of  the  ^  program. 

Parcel  execution  cremes  two  problems.  The  first  is  that  events  such  as  breakpoints  and 
exceptions  may  occur  in  the  parallel  program  in  a  different  order  then  they  would  have 
occurred  in  the  sequential  program.  Virtual  times  are  used  to  determine  the  correct  order 
for  presenting  events  to  the  user.  The  second  problem  is  that  when  the  parallel  program 
stops,  it  might  not  be  in  a  state  that  is  consistent  with  any  stale  of  the  ^  program.  If  we 
allow  the  user  to  examine  or  modify  variables,  then  tbe  behavior  might  not  be  correct 
(as  defined  by  Definition  2-3).  Tbe  debugger  must  detect  when  the  state  of  tbe  program 
IS  not  consistent  with  a  state  of  tbe  p  program  and  prevent  tbe  user  from  doing  anything 
that  will  cause  inoonect  behavior.  A  debugger  can  also  provide  mechanisms  that  make  it 
possible  for  the  user  m  put  the  program  into  a  state  that  is  consistent  with  a  state  of  the  0 
program. 

We  begin  with  Section  4.3.1.  which  defines  some  tenns.  Tbe  bask  functioning  of  the 
debugger  is  described  in  Section  4.32,  which  introduces  the  subKxnnpoiients. 


SOURCE-LEVEL  DEBUQQWa  OF  AUTOMATICALLY  PARALLELIZED  PROGRAMS 


45 


Hm  ttmad  ipUltiiig  tramformatioii 


4J.1  Twrm 

bi  diis  secikn  we  (ksfine  some  tenns  tbM  aie  used  in  dK  diaoMioo  of  pualld  execo> 
tot  The  cooce|«  of  virtual  tiine.a>dcommiaing  events  are  talceo  from  Jeflenioo’s 
Hme  Wup  Operating  System(J^enon  83]  [JeSsnoo  87].  An  event  is  vi  occurreik't 
caused  by  the  executoi  of  the  progiam  that  requires  some  debugger  action,  like  break- 
poina  or  exceptions.  The  virtaoi  rime  o/oA  event  is  the  virtual  time  the  statement  that 
caused  the  event  A  breakpoint  or  exception  is  pending  if  it  has  occurred  bid  hasn’t  been 
presetted  to  the  user  yet  Committtng  on  event  is  the  aa  of  presentiiv  a  pending  event  to 
the  uaet  In  some  sitnarimM,  the  debugger  may  not  be  able  to  let  the  user  exanine  a  vari- 
abie,  modify  a  variable,  or  seta  breakpoint  and  sdU  mamtain  correct  behavior,  bi  this 
case,  the  driMgger  tells  the  user  that  it  cannot  complete  the  actioiL  We  call  this  dimiioH'- 
ing  a  oommand.  In  some  cases,  we  may  want  to  allow  a  processor  or  set  of  processors  to 
execute  until  they  teach  a  specific  virtual  time.  call  this  roll  forward. 

4.3ul  Oebuggnr 

The  ddNigger  functions  as  follows.  The  user  starts  the  program,  and  it  executes  in  paral¬ 
lel  Next,  an  event  such  as  a  breakpoint  or  exception  occurs  on  one  or  more  processors 
or  the  user  aborts  execution  of  the  program.  The  debugger  stops  all  processors  and 
determines  their  virtual  times.  All  events  ate  madced  as  pending.  The  debugger  then 
picks  a  virtual  time  in  the  execution  of  the  ^  program  to  lepresent  to  the  user  as  the  cur¬ 
rent  time;  we  call  this  the  ^  time.  For  correct  behavior,  it  must  appear  to  the  user  that  all 
operations  with  a  lower  virtual  Him  have  been  executed  and  that  no  operations  with  a 
higher  virtual  time  have  been  executed  yet  Picking  the  P  time  is  explained  in  Section 
4.3  J.  After  selecting  the  ^  time,  the  detegger  can  present  pending  events  to  the  user, 
md  wait  for  the  user  to  execme  some  commands.  If  the  state  of  the  program  is  not  con¬ 
sistent  with  a  state  of  the  0  program,  then  the  debugger  must  disallow  some  commands. 
The  method  for  decirfing  if  commands  should  be  disallowed  can  be  found  in  Section 
4.3.4.  La  Section  43.5,  we  conclude  with  a  description  of  what  the  dtoigger  can  do  to 
help  the  user  if  a  commmid  is  disallowed. 

4.3.3  Choosing  ths  p  Umo 

Id  some  cases,  we  have  some  flexibility  about  what  time  the  debugger  picks  as  the  ^ 
time.  This  decisioo  determines  which  operations  ate  early  and  which  opermkns  are  late, 
which  in  turn  determines  which  variables  may  be  inflected  or  modified  and  where 
breakpoints  can  be  set  Possible  choices  range  fitom  the  earliest  virtual  time  of  all  the 
processors  to  the  latest  virtual  time.  If  we  pick  the  earliest  virtual  time,  then  we  can  be 
sure  that  there  ate  no  late  operations.  The  debugger  should  never  pick  a  time  earlier  than 
the  earliest  processor  because  it  would  only  increase  the  number  of  emty  operations 
without  decreasing  the  number  of  late  operations.  If  we  pick  the  latest  virtual  time,  then 
we  can  be  sure  that  there  are  no  earty  operations.  The  debugger  should  never  pick  a  time 
later  than  the  latest  processor  becaure  it  would  increase  the  number  of  late  operatimis 
without  decreasing  the  number  of  early  operations. 

Intemipls  by  the  user  are  not  tied  to  a  particulm  statement  of  the  program,  hence  there  is 
some  flexibility  for  choosing  the  P  time  when  this  occurs.  If  our  cmly  concern  is  to  min¬ 
imize  the  number  of  early  and  late  operations,  then  the  best  time  is  the  latest  virtual  time 
of  all  the  processors.  With  this  chok^  there  aren’t  any  early  operations,  and  we  can  toU 


48 


SOURCS-LEVEL  OEBUQGMQ  OF  AUTOIIAT.2AU.Y  PARAU.EU2ED  PROGRAMS 


PmHci  MMculioii  whfe  d•buQ0lll0 


forwanl  aU  die  prooeason  with  a  virtual  time  lower  than  die  ^  time  iMUil  tboe  are  DO 
more  tale  operatioos. 

FUfnhmriitg  all  the  late  aod  early  operatians  is  beneficial  because  the  current  state  will 
be  consistent  with  a  state  of  the  0  program.  In  such  a  state,  the  drixigger  will  not  disai- 
low  any  commands.  However;  rolling  forwanl  to  the  latest  virturd  time  may  prevent  the 
debugger  fiom  having  a  timely  response  to  interrupts.  For  exaa^pto,  if  a  lo^  has  1.000 
iterations,  and  each  processor  is  assigned  a  Mock  of  100  conti^nous  iterations,  then  one 
processor  will  hnmediaieiy  execute  iteration  900  of  the  loop.  If  we  chose  the  highest 
virtual  time  as  the  P  time,  then  for  an  interrupt  the  debugger  would  always  roll  forward 
to  at  least  iteratioo  900  in  program,  no  matter  wba  the  interrupt  occurs  and  bow  long  it 
takes  for  the  program  to  get  to  iteration  900. 

For  mterrupts,  we  believe  that  a  timely  response  is  more  important  than  being  able  to 
stop  in  a  Slate  where  there  are  no  early  operations.  Since  it  simple  to  roU  processors  for¬ 
ward,  the  p  time  should  be  the  virtual  time  of  the  earliest  processor.  The  user  can  late 
decide  to  roll  processors  forward  to  eliminate  early  and  late  operations,  but  they  still 
have  the  opportunity  to  inspect  earlier  state. 

Unlike  interrupts,  there  is  no  choice  in  the  selection  of  the  ^  time  when  breakpoints  oi 
exceptions  are  pending.  If  the  ddxigger  is  to  present  events  in  source  order,  the  ^  time 
must  be  the  time  of  the  earliest  pending  event  Before  committing  the  event  the  debug¬ 
ger  must  roll  forward  all  processors  with  late  operatkms  to  make  sure  that  there  are  no 
late  events.  If  there  are  any  late  events,  then  th^  would  have  a  virtual  time  less  than  the 
P  time  of  the  bteaiqiouit  or  exception  so  they  must  be  presented  first  If  new  events 
occur  during  roll  forward,  those  events  becone  pending,  and  the  newp  time  is  the  time 
of  tbe  eariiest  pending  event 

To  summarize,  user  interrupts  are  handled  differently  ITOm  events  such  as  breakpoints 
and  program  exceptions.  If  tbe  program  is  running  and  an  event  occurs,  we  stop  all  pro¬ 
cessors  and  determine  their  virtual  time.  Thep  time  is  chosen  to  be  the  time  of  tbe  earii¬ 
est  pending  event  We  then  roil  forward  all  processors  to  that  ^  time.  Another  event  may 
occur  during  roll  forward;  if  it  does  the  debugger  handles  it  tbe  same  as  if  tbe  program 
were  tunoing  and  an  event  occurred.  Once  roll  forward  is  complete,  we  commit  tbe 
event  by  letting  the  user  know  that  a  breakpoint  or  an  exception  has  occurred.  If  tbe  user 
continues  execution  and  there  are  still  events  pending,  then  tbe  debugger  acts  ^  if  the 
program  were  running  and  another  event  just  occurred. 

If  the  user  interrupts  execution  of  tbe  program  and  there  are  no  events  pending,  we  stop 
ail  processors  and  determine  their  virmal  time.  Tbe  ^  time  is  the  lowest  virtual  time  of 
all  the  pitx;essors  and  no  roil  forward  is  necessary  because  there  are  no  late  operations. 
For  eil^  events  or  interrupts,  we  never  stop  tbe  program  with  any  late  operations 
because  the  0  time  is  always  tbe  lowest  virtual  time  of  ail  tbe  processors. 

4.3.4  Disallowing  usor  commands 

After  the  program  has  stopped,  it  may  not  be  in  a  state  that  is  consistent  with  execution 
nf  the  p  program.  If  it  is  not  in  a  consistent  state,  then  using  tbe  debugger  to  exmnine  or 
modify  tbe  program  state  might  lead  to  incorrect  behavitx  of  tbe  debugger.  To  avMd 
incorrect  behavior,  the  debuggm  must  decide  which  commands  to  disallow.  A  command 


iOURCC-LEVEL  OEBUQGmQ  OF  AUTOIIATICAU.Y  PARAUEUZEO  PnOCRAMS 


47 


Th«  tfir— d  ■plittliHl  trawferniwbon 


OMitt  be  disal]o«^  if  its  betmior  is  Gbaoged  by  oHly  openMjNm  For  exaaq|>le,  an 
write  of  a  wiabk  would  ause  the  debugger  lo  display  lbs  wroog  vaiuB  if  ibe  user 
examined  ibat  variable. 

Computing  die  set  of  eariy  operaiiaas  is  (be  first  step  in  dedkliag  if  a  cammaad  should 
be  disdlowed.  Wb  expiain  this  in  Section  4J.4.1.  Heaa,  tbe  debugger  must  determine 
Ibe  effect  that  tbe  early  operations  have  on  tbe  panicuiar  command.  In  Sectioo  4.3.4.2 
and  Section  4  J.4J.  we  expiain  what  type  of  airly  operations  wiU  cause  incarrect 
behavior  (br  each  of  tbe  debugger  commands. 

4A4.1  Compudna  the  eel  of  early  opnratiofie 

The  set  of  early  operadons  are  the  instaoces  of  statements  that  are  executed  that  have  a 
virtual  dme  greater  than  or  equal  to  the  P  dme.  There  can  be  no  general  and  practical 
method  to  oompute  the  exact  set  of  early  operadons  because  the  execution  of  some 
operadons  can  be  data  depoideat  For  example,  if  there  were  a  condiik»ai  inside  a  loop, 
and  tbe  loop  executes  some  early  iteradoos.  we  would  have  to  know  which  way  the  con- 
didonal  branched  in  tbe  past  to  know  if  tbe  operadons  contained  in  the  then  chuise  are 
early  operadons.  This  would  require  the  debugger  to  record  a  complete  trace  of  execu¬ 
tion  of  operations  until  it  is  certain  that  the  operatioiis  are  not  early. 

We  can  use  a  conservative  approximation  for  the  set  of  early  operations.  In  this  omtext, 
a  conservative  approximation  includes  all  operations  that  are  executed  early  and  may 
include  some  operations  which  were  not  executed  early.  This  may  lead  us  to  incorrectly 
disallow  a  command,  but  we  win  never  permit  a  cominand  that  should  have  been  disal- 
lowed. 

We  can  oompute  the  setof  eariy  operations  for  one  processor  by  computing  for  each 
statement  the  set  of  possible  virtual  times  over  the  entire  execution  of  the  progian.  We 
then  eliminate  all  the  virtual  times  that  are  less  than  tbe  p  time  or  greater  than  or  equal  to 
the  current  time  of  the  processor. 

Computing  the  set  of  virtual  times  for  a  statement  is  a  straightforward  extension  of  tbe 
method  we  use  to  compute  the  current  virtual  time  of  a  processor,  as  is  described  in  Sec¬ 
tion  4.2,1.2.  For  each  statement,  we  start  with  tbe  virtual  time  template.  Tbe  values  for 
dlat  variables  can  be  determined  from  the  processor  index,  lb  emnpute  tbe  set  of  vir¬ 
tual  times  possible,  we  hutiaily  assume  that  the  value  fisr  all  loop  cowters  in  a  virtual 
time  templaie  have  a  range  from  0  to  •>•.  Next  we  remove  virtual  times  from  the  set  that 
are  less  dun  tbe  P  time  or  greater  than  tbe  current  time. 

For  the  example  oo  program  in  FIGURE  4-4,  assume  that  processor  0  sttqiped  at  virtual 
time  (23.6,2,6)  and  processor  1  stt^iped  at  virtual  time  (2,6,3,I,4).  The  p  time  would  be 
02,3,63,6).  For  processor  0,  the  p  ihne  is  the  same  as  the  current  time  of  tbe  processor; 
so  we  know  that  there  are  no  eariy  operations.  For  processor  1,  there  are  eariy  opera¬ 
tions.  Tbe  first  line  of  the  oi  program  has  tbe  virtual  time  set  of  (2,  n  3,0,0)  where 
0^A^<».  The  first  virtual  time  of  tbe  set  that  is  greater  than  or  equal  to  (he  Ptimeis 
(2,4.Z0,0).  so  we  can  bound  it  from  below  by  4.  The  highest  virtual  tinre  that  is  less 
than  the  current  time  is  (2,63.0.0),  so  we  can  bound  n  Grom  above  by  6.  This  leaves 
(2,a3.0,0),  4^n^6astbesetof  eariy  operations  for  tbe  first  statement 


48 


SOURCE-tSVEL  OEBUGGMQ  OF  AUTOIIA71CAU.Y  PARAUELCEED  PROGRAMS 


PsnriM  nwtMion  white  dcbuQQiiiQ 


The  second  iiite  of  the  a>  iwogrnn  it  conaiiied  inside  a  dlst.  la  viioial  dme  tempteB 
ccntaiDS  the  diet  variable  «.  which  hat  a  value  of  0  on  {KDcesaor  0  and  1 00  processor 
1.  Because  the  virtual  time  tempiaieofthe  statement  is  <^eigot  on  each  processor,  the 
virtual  time  set  of  the  statement  oo  each  pfocessor  is  differeotOupfocessor  1,  the  set  is 
(2.e3.M),  where  The  lowestvirtual  time  greatertbanorequal  to  the  pUme 

is  (2,43,1.4).  The  highest  virtual  time  that  is  less  than  the  cuirent  time  is  (23,3,1,4}. 
This  leaves  (2,e3.1,4),  4<;nj;Sastiiesetof  earty  operations  for  the  second  statement 

For  the  third  statemem  of  the  (d  program,  there  are  two  loop  coimters.  The  set  is 
(2,a,6,m,ti)  where  0  £  A  ^  aDd  0  2  m  S  «•.  For  the  low  time;  3  and  2  are  chosen  as 
values  of  the  j  and  k  loop  counieis  because  that  is  their  value  in  the  p  time.  The  low 
time  is  (23.633)  and  the  high  time  is  (23.6,<»,6).  The  set  of  earty  operations  is 
(2,3.6.Ai,6),  2^ai^<>*  and(2A6.m,6).  4£;aS5  and  OS  fnSM.llie  set  of  early  oper¬ 
ations  for  the  fourth  statement  of  the  to  program  is  (23.6,ai,8),  2  S  m  S  «•  and 
(2.a,6,ai,8).  4SaS5  andOSAtSw. 


FIGURE  4-4  P  pragram  and  to  pragram 

P  program: 


1 

rep 

a  =  0  to  _np[03  with  <a> 

2  <0> 

for 

(j  =  0;  j  <  n;  j++)  { 

3 

dist  a  =  0  to  «npt0]  with  <a> 

4  <0> 

b  *  1; 

5 

rep  a  =  0  to  _np[0]  with  <a> 

6  <0> 

for  (k  3  0;  k  <  m;  k-t-*-) 

7 

rep  a  »  0  to  _np[0]  with  <a> 

8 

Aik]  3  b; 

9 

} 

tn  progreun: 
2, j, 2, 0,0 

for 

(j  =  0;  j  <  n;  j++)  { 

2, j,3,a,4 

b  3  1; 

2 ,  j ,  6 ,  Ic,  6 

for  (k  3  0;  k  <  10;  k++) 

2,j,6,k,8 

Aik]  3  b; 

) 


43.4.2  Examining  and  modifying  variabtee 

If  we  allow  the  user  to  examine  or  modify  a  variable  when  there  are  early  or  late  opera¬ 
tions,  we  must  observe  tbe  same  dependence  rules  that  a  compiler  must  observe  when 
scheduling  code.  These  rules  determine  when  operations  may  be  reordered.  We  use  the 
terminology  Grom  data  dependence[Padua  86]  to  i^irase  tbe  constraints. 

There  is  ayZow  dependence  between  two  operations  when  one  operatkm  stores  to  a  vari¬ 
able  and  a  later  operation  reads  it.  without  any  intervening  stores.  In  tbe  following 
example,  tbe  second  statement  is  flow  dependent  oo  the  first  statement  because  tbe  first 
statement  stores  into  «  and  the  second  statement  reads  a. 


SOURCC-IEVEL  OCBUQGMQ  OF  AUTOMAUCALLY  PARAUEUZEO  PROGRAMS 


49 


«  a  1; 
b  a  a; 


lliere  U  an  oiiqntf  <^;xiiinice  between  two  opocadcni  when  tbey  store  to  tbe  same 
vaffadte  and  tbere  aie  no  intervening  stoies.  In  the  foUowing  example,  tbere  is  an  output 
dependence  between  the  first  and  second  statement  because  they  both  store  to  the  same 
wiible. 

b  a  1; 
b  a  2; 

ThereisanoiWidepefideficebetweeatwoopentiooawhenoiieoperatiaDreadsavari- 
aUe  and  a  later  one  writes  the  sane  van^>ie. 

If  the  user  wants  to  examine  a  varuri>le.  then  there  can  be  no  late  or  early  writes.  Exam* 
ining  a  variable  is  treated  the  same  as  a  read  of  the  variable  at  the  point  in  the  program 
oonesponding  to  the  0  time.  If  there  are  late  writes  of  the  examined  variable,  then 
allowing  the  user  to  look  at  its  value  would  be  the  same  as  moving  a  read  before  a  write, 
which  violates  a  flow  dependence.  If  there  is  an  early  write,  then  we  would  be  moving  a 
read  after  a  write,  which  violates  an  anti  dependence. 

If  the  user  wants  to  modify  a  variable,  there  can  be  no  late  or  early  operadoos  which 
read  or  write  the  variable.  A  late  reads  violates  an  anti  dependence,  a  late  write  violates 
an  output  dqiendenoe,  an  early  read  vioiaies  a  flow  dqiendence,  and  an  early  write  vio> 
btes  an  output  dependence. 

hi  Sectioo  4  J.4. 1,  we  compute  the  set  of  operations  that  are  early;  we  can  use  this  infer* 
matian  to  determine  which  variables  have  early  reads  or  writes.  If  a  statement  has  an 
early  execution,  then  the  variables  in  that  statement  have  early  reads  or  writes.  Since  the 
early  execution  information  is  conservative,  the  early  read  and  write  information  is  con¬ 
servative  as  weU. 

in  the  example  in  FIGURE  4-4,  we  concluded  that  processor  0  has  no  early  operations, 
thus  it  doesn’t  have  any  early  reads  or  writes,  and  all  variables  on  that  processor  can  be 
examined  and  modified  at  the  canent  point  in  the  execution  of  the  program.  Processor  1 
has  executed  some  early  operations.  The  first  statement  has  been  executed  early,  so  we 
can  conclude  that  the  variables  j  and  a  have  early  reads  and  the  variaitiej  has  an  early 
write.  Because  the  second  statement  has  been  executed  early,  we  can  conclude  that  the 
variable  b  has  some  early  writes.  The  third  statement  has  also  been  executed  early,  so 
we  can  conclude  that  the  variables  k  and  b  have  early  reads  and  the  variable  k  and  the 
array  A  have  some  early  writes.  The  only  variable  that  doesn't  have  au  early  write  is  n 
oo  processor  1.  so  that  is  the  only  variable  that  can  be  examined  by  the  usee  Every  vari¬ 
able  on  processor  1  bas  an  eariy  read,  so  opoe  of  them  can  be  modified. 

We  have  conservativdy  assumed  that  if  a  statement  has  an  early  read  or  write  of  an 
array,  then  the  entire  array  bas  an  early  read  or  write.  If  we  have  more  information  about 
wind  element  of  an  array  each  iteration  of  the  loop  references,  we  can  conqwte  finer 
grain  mformation  abom  whkb  elements  of  the  array  have  early  reads  or  writes.  La 
example  program,  we  know  that  iteratioo  i  of  the  in  the  tLnrd  statement  writes  ele¬ 
ment  A  lil .  If  the  current  time  of  processor  1  is  (2.3,i{,4,6)  and  the  time  processor  0  is 


80  aOUHCg-LEVEL  DeaUQOBMQ  OF  aUTOMATiaUJ.Y  PAR4I  LeJOED  PWOOlUilS 


PmriW  MHSUtion  tMtitodtlMigQiiio 


sdO  C2JA2.6)  at  k  wai  ia  tbe  previous  example,  tlien  dx  P  time  a  stiB  (2J.6.2.(S).  The 
only  early  opeiatioiis  are  <rf  processor  Ion  statement  3.  wfaidi  is  (2J.6.a,6)wlieie 
2^  3.  hi  this  case,  there  areearly  writes  of  the  array  elements  A  [n]  where 

2^Aj:3. 

<3.0  Salting  breakpoints 

If  a  hreakpoint  is  set  on  a  statement  that  has  early  executioos,  then  some  of  the  break- 
pohus  that  the  program  would  have  normally  reached  will  be  miased.  Tbe  debv^er  can- 
not  be  certain  because  the  set  of  early  operatioos  is  conservative;  we  only  know  that  the 
statement  nd^  have  been  executed  early.  If  the  user  sets  a  breakpoint  on  a  statement 
with  early  executions,  then  the  debugger  should  warn  the  user  that  some  breakpoints 
might  have  been  missed,  Tbe  debugger  has  enough  infotmaiion  to  tell  the  user  which 
iteratians  the  missed  breakpoints  come  from. 

4.3.S  What  to  do  whnn  eommancls  ara  disailownd 

When  a  ctmunand  is  disallowed  because  smne  processtns  have  executed  early  opera¬ 
tions,  the  user  might  be  able  to  get  the  infotmarion  that  they  need  another  way.  They 
might  inspea  another  variable  or  set  a  breakpoint  on  another  statement  However,  in 
some  cases,  they  may  have  to  inspect  a  particular  variable.  In  this  situation,  they  could 
try  to  run  their  program  again  and  hope  that  tbe  next  time  it  stops  in  a  state  that  has  the 
iafonnatian  they  need.  Even  if  we  run  the  program  again,  it  might  not  stop  in  a  better 
state  on  successive  tries. 

There  ate  several  ways  that  a  debugger  can  help  this  problem.  It  can  rerun  tbe  program 
from  the  beginning  and  stop  in  a  state  with  tbe  same  P  time,  but  no  early  operations.  Tbe 
debugger  could  also  restrict  execution  of  tbe  parallel  program  so  that  wbro  it  stops  at  a 
breakpotot,  it  will  be  in  a  state  without  any  e^y  operations.  We  call  this  a  consistent 
breakpoua  because  tbe  program  stops  in  a  state  that  is  consistent  with  a  state  in  tbe  exe¬ 
cution  of  the  9  prognm.  Consistent  bfeakpoints  can  sacrifice  some  or  all  of  the  parallel¬ 
ism  of  the  program  but  might  cost  less  in  tbe  long  tun  when  compared  to  repeat^y 
tenmning  the  program. 

Both  of  these  choices  can  be  viewed  as  a  special  case  of  roll  forward.  The  first  is  rolling 
forward  all  tbe  processors  from  tbe  beginning  of  the  program  and  stopping  at  a  ^tedfied 
virtual  time,  which  in  this  case  happens  to  be  the  current  P  time.  Tbe  second  is  rolling 
forward  all  processors  to  the  virtual  tiine  of  the  next  potential  breakpoinL 

We  first  introduce  the  mechanisms  that  are  needed  for  roll  forward  in  Section  4.3.5.I. 
We  then  explain  bow  roll  forward  can  be  used  to  do  rerun  in  Sedion  4.3 .5 .2  and  consis¬ 
tent  breakp^ts  in  Section  4.3 

4,35.1  Roll  fonmid 

For  roll  forward,  we  want  to  advance  tbe  execifiion  of  all  processors  until  they  have  exe¬ 
cuted  aO  statements  with  a  virtual  time  less  than  some  target  time  t.  We  shall  call  tbe 
statement  in  tbe  0  program  whose  viriual  tiine  is  tbe  same  as  tbe  target  time  tbe  target 
statement.  To  roll  forward,  we  could  just  single  step  every  processor  ami  dieck  tbe  vir¬ 
tual  time  after  every  step,  stopping  when  tbe  current  time  of  a  processor  is  greater  than 
or  equal  to  t.  However;  this  is  very  costly.  Instead  we  would  lite  to  compute  in  advance 


SOURCe-UEVELOCBtXlOMa  OF  AUTOklATICAU.1  :  ARAUEUZEO  PROORAin  SI 


the  statemeot  dM  eadi  pracesaor  should  slop  at,  oMl  set  a  iMafciwte  diere.  If  1^ 
meat  is  inskiB  a  loop,  thee  the  fim  dme  dMt  the  stattSBMtt  execMBs  oight  not  be  the 
time  that  we  watt  to  stop.  TIk  biealqniot  must  be  cooditioaed  oo  the  vinud  time  beiog 
greater  ihaa  or  equal  to  the  target  time.  We  shall  call  the  statemctt  sad  the  vntuai  time 
that  a  processor  should  stop  at  the  least  upper  bouad  (LUB)  because  it  is  the  lowest  vir¬ 
tual  time  of  the  processor  that  is  greater  thaa  or  equal  to  the  target  time.  For  auy  given 
target  time,  each  processor  will  have  its  own  LUB. 

We  can  coaiputB  the  LUB  fix  each  processor  by  stanhig  at  the  target  staiemeat  in  the  p 
program  and  tmdng  the  coatrol  How  uatil  we  find  one  statemeu  and  virtual  time  for 
each  processor.  Every  time  we  find  a  PIL  for  a  ptocesaor  that  does  not  yet  have  a  LUB. 
we  record  the  virtual  time  and  staaement  as  the  LUB  for  that  processor;  and  then  remove 
the  processor  fitom  the  list  of  processors  that  do  not  hrtve  LUB ’s.  This  method  of  finding 
LUB’s  assumes  that  the  program  executes  the  statement  at  the  target  time.  If  the  control 
flow  never  gets  to  the  target  statement,  there  is  no  guarantee  that  it  will  get  to  the  LUB 
either.  The  previous  method  that  single  steps  until  we  reach  the  target  time  always 
works. 

The  program  in  FIGURE  4*5  iBustraies  this  procedure.  The  program  at  the  top  is  a  3 
program  and  the  program  below  it  is  the  same  3  program  with  the  rap  and  dlae  con* 
strocts  expanded.  All  of  the  following  discussion  rrfers  to  the  expaixled  3  program. 

If  the  tmget  time  is  (2,3,5.  l.d),  then  we  start  the  search  for  the  LUB  at  the  second 
assignment  to  the  variabie  a  because  only  this  statement  can  have  the  target  time  as  a 
virtual  time.  The  scaiemenc  ht  mapped  to  processor  1,  so  this  is  the  LUB  for  processor  1. 
If  we  trace  the  execution  of  the  program,  we  skip  ova  theulnw  clause  and  execute  the 
next  if  statement  The  If  statement  is  mapped  to  both  processors,  so  this  statement  is 
the  LUB  for  processor  0.  The  virtual  time  of  the  LUB  is  in  the  same  iteration  as  tbc  tar* 
get  time,  (23,12,0,0).  When  tracing  control  flow,  we  oeva  have  to  trace  the  body  of  an 
If  staiementafta  cbeddngtbe  part  of  tbe  statement  tbatomtaiiis  the  condition.  Recall 
that  the  PIL’ s  of  the  condition  line  (rf  an  1  f  statement  must  be  a  superset  of  the  PIL’s  of 
the  statements  in  the  body.  If  we  don’t  getamaicfa  on  the  PIL  of  the  line  with  tbe  coodi* 
tion,  we  carmot  get  a  match  on  any  PIL  in  die  body.  This  is  important  because  we  cannot 
predict  future  bebaviw  of  the  program,  and  cannot  know  wbetfaa  to  trace  the  then  or 
else  clause  when  searching  for  the  LUB. 

For  this  example,  when  we  want  to  roll  forward  to  a  target  time  of  (23.5,1.6),  the 
debugga  sets  a  breakpoint  on  line  6  on  processor  1 .  When  the  value  of  tbe  loop  counta 
1  is  3  when  executing  this  statement,  the  debugga  stops  the  processot  For  processor  0. 
the  ddwgga  sets  a  breakpoint  00  line  12,  and  stops  the  processor  when  it  reaches  the 
brealqpoint  and  the  value  of  1  is  3. 

Anotha  example  iHustraies  how  to  trace  oooiroi  with  loops.  If  tbe  target  virtual  time  is 
(2,3,16,1,17),  then  the  starting  point  in  the  program  is  the  last  statement  in  the  loop.  This 
statement  is  mapped  to  processor  1,  so  this  is  the  LUB  for  that  processor.  If  we  trace  tbe 
execution  of  tbe  progrtan,  we  follow  the  loop  back  path  to  the  beginning  of  tbe  loop. 
The  loop  heada  is  mapped  to  processor  0.  so  this  stateaoent  is  tbe  LUB  for  that  proces* 
soc  The  virtual  time  of  the  LUB  is  in  the  next  iteration  of  the  kxip,  (2,4.0,0.0).  Just  like 
the  If,  we  neva  have  to  trace  tiiroogb  the  body  of  a  loop  aftor  cbeckiag  the  beada 
because  the  PIL’s  of  tbe  taeada  are  a  superset  of  the  PIL’s  the  body.  If  we  do  not  get  a 


S2 


fiOUnce-iEVCL  OCBUCKIMtl  OF  Amo«U11Cm.Y  PAIUIXEUZED  PROCnAIM 


ParaM  Mneudon  whfo  dsbuggbig 

FIGURE  44 

A  p  program  wid  Hs  exparxiod  voraion. 

Vtatualdme 

^program 

1 

rap  for  m  a  0  to  1  with  <m> 

(2, i, 0,0,0) 

2  <0> 

for  (1  a  0;  1  <  10;  1++)  { 

3 

rap  for  m  a  0  to  1  with  <in> 

(2, 1.3, 0,0) 

4  <0> 

If  (a  a«  1)  ( 

5 

diat  for  n  a  0  to  1  with 

<n> 

(2,i,5,n,6) 

6  <0> 

a  a  2; 

7 

)  else  { 

8 

dist  for  n  a  0  to  1  with 

<n> 

(2fl#d«n«9} 

9  <0> 

b  a  2; 

10 

} 

11 

rep  for  m  a  o  to  1  with  <in> 

(2, i, 12, 0.0) 

12  <0> 

If  (e  a=  1)  ( 

13 

dist  for  n  a  0  to  1  with 

<n> 

(2,i,13,n,14) 

14  <0> 

c  a  2; 

IS 

}  else  ( 

16 

dist  for  n  a  0  to  1  with 

<n> 

(2.i,16,n,17) 

17  <0> 

d  a  2; 

18 

} 

19 

) 

Viitiial  time 

P  program  with  expanded  rap  and  diet 

O 

o 

o 

■H 

<0>, <1> 

for  (1  a  0;  1  <  10;  i++)  { 

(2, i, 3, 0,0) 

<0>,<1> 

if  (a  ==  1)  { 

(2, i, 5,0, 6) 

<0> 

a  a  2; 

(2, i, 5, 1,6) 

<1> 

a  a  2; 

)  else  ( 

(2, i, 8, 0,9) 

<0> 

b  a  2; 

(2, i, 8, 1,9) 

<1> 

b  a  2; 

) 

if  (e  a*  1).  { 

(2,1,12,0,0) 

<0>, <1> 

(2,1,13,0,14) 

<0> 

c  a  2; 

(2,1,13,1,14) 

<1> 

c  a  2; 

)  else  { 

(2,1,16,0,17) 

<0> 

d  a  2; 

(2, 1,16,1,17) 

<1> 

d  a  2 ; 

) 

) 

SOURCE-LEVEL  OEBUOGWQ  OF  AUTOMATICALLY  PARALLELIZED  PROGRAMS 


53 


TIm  llwMd  ipIMna  traMfenmikMi 


mttdi  oo  any  PIL  of  the  beader,  we  cannot  get  a  nutcb  oa  any  PIL  in  tbe  body.  Thtt  is 
inqxiitaot  beouse  we  cannot  pnxikt  the  future  befadvior  of  the  kxip.  other  than  to  know 
that  if  the  loop  executes  the  body,  it  will  always  execute  the  header  one  moie  time  to 
check  if  it  should  execute  the  next  itenuioa. 

4b3.&2  Raran 

For  letun,  we  wt  to  put  the  program  in  stale  where  it  has  the  same  P  time  as  the  cur- 
leat  stale,  bat  then  are  no  early  opecaikms.  We  can  do  this  by  starting  the  program  horn 
the  begiiaitng  and  roUiiig  to  the  p  thne. 

When  we  reran  a  program,  the  final  state  should  be  the  same  as  the  original  state,  except 
that  there  should  not  be  any  early  operations.  The  state  is  the  saane  if  the  program  is 
determinate  and  the  executioo  eaviroomeot  is  reproduced.  Code  produced  by  thread 
spBiting  is  determinate,  thus  there  aumot  be  any  schedule  dqren^t  bugs  that  prevent 
rerun  from  reaching  the  desired  state.  To  reproduce  the  execution  environment,  external 
I/O  must  be  the  same,  memory  should  be  initialized  the  same  way,  etc.  Reproducing  the 
execution  enviromnent  can  be  difficult  However,  users  typically  design  their  programs 
so  that  they  can  be  rerun  because  debugging  a  program  usually  requires  that  the  pro* 
gram  be  tun  many  times.  A  ddxigger  can  make  this  easier  by  initializing  memory  and 
logging  and  playing  back  any  external  I/0[Pan  88]. 

4.3.&.3  CoiMiatMit  braakpointe 

When  users  set  breakpoints,  they  are  usually  interested  in  the  state  of  the  program  at  that 
point  in  the  execution.  Rather  t^  letting  the  program  tun  unrestricted,  we  can  use  a 
consisient  breakpoint  where  the  debugger  ooDtrols  execution  so  that  the  program  breaks 
in  a  state  that  is  consistent  with  a  state  of  the  P  program.  Tliis  can  reduce  the  paralletism 
of  a  program.  We  set  breakpoints  without  spedfying  the  iteration;  if  the  breal^int  is  set 
on  a  statement  inside  a  conditiooal,  the  program  could  stop  at  any  iteration.  If  we  want 
to  be  sure  that  we  have  not  executed  any  early  iterations,  tben  we  must  execute  one  iter¬ 
ation  at  a  time. 

A  consistent  breakpoint  can  be  implemented  with  roll  forward.  For  each  breakpoint  set 
by  the  user,  we  compute  the  time  of  the  next  breakpoint  This  can  be  done  by  finding  the 
smallest  virtual  time  in  the  set  of  virtual  limes  for  a  breakpoint  statement  that  is  greater 
than  or  equal  to  the  P  time.  For  each  processor,  we  must  check  if  there  are  any  early 
operations,  assumii^  that  the  time  of  the  earliest  breriqpoint  is  the  next  p  time.  If  there 
are  early  operations  for  that  P  time,  then  we  cannot  have  a  consistent  breakpoint  Next 
for  every  processor  we  compute  one  LUB  for  each  breakpoint  and  roll  forward  to  tbe 
tiffgettime. 

If  the  breakpoint  is  in  a  ctxidicional  or  a  loop,  tben  tbe  breakpoint  may  or  may  not  occur, 
depending  on  wtaetber  the  program  executes  tbe  body  of  tbe  conditional  or  hxip.  If  tbe 
breakpoint  does  occur,  then  tbe  roQ  forward  mechanism  leaves  tbe  progrmn  in  a  «ate 
where  there  are  DO  early  operations.  If  tbe  breakpoint  does  not  occi«  tben  tbe  other  pro¬ 
cessors  may  or  may  not  stop  at  the  LUB  for  tbe  breakpoint  since  we  only  guarantee  that 
a  processor  will  reach  a  LUB  if  the  program  reaches  the  target  statement 

If  a  processor  stc^  at  tbe  LUB  for  a  breakpoint  and  tbe  bretdqwiin  does  not  occur,  then 
we  must  restart  tte  processor.  The  difficulty  is  knowing  when  it  is  safe  to  restart  If  any 
processor  runs  past  tbe  target  virtual  time  then  the  breakpoint  will  not  occur.  If  any  pro- 


54 


hOORCg-teVEL  06BUOQWQ  Of  AUTOMATICALLY  PARAl  I  FI  gTO  PROCSRAMS 


LIIWIWWii  I 


cesson  «e  stopped  at  tbe  LUB  and  otbei  piocesson  aie  sdU  nmning  we  must  poll  tfae 
nmningprocessorafortbeirvirtualiiiiieuniiltbeyaUstopattiieLUBoraaeofthepiD- 
cesson  tuns  past  tbeir  LUB.  If  a  piocessor  runs  past  its  LUB,  we  start  over.  Tlie  debug¬ 
ger  stops  all  processors,  computes  the  eariiest  viitual  time  for  each  breakpoint  and  tbe 
new  LUB’s,  and  continues  the  roll  forward. 

If  we  need  to  break  at  every  iieratioo  of  a  parallel  loop,  tboi  there  is  little  parallelism 
that  can  be  exploited  in  that  part  of  the  progranL  However,  there  may  be  paraUelism  ear¬ 
lier  in  the  program.  Consistent  bieakpo^  allow  the  user  to  take  advantage  of  parallel¬ 
ism  in  parts  of  the  program  that  tfae  user  does  not  care  about,  while  avoiding  tbe  need  for 
lenm.  In  the  worst  case,  using  consistent  breakpoints  is  comparable  to  executing  with  a 
sequential  schedule. 


4.4  Efficiency 


A  debugger  with  dynamic  order  restoration  must  do  extra  work  when  compared  to  a 
conventional  debugger  for  sequential  programs;  however,  we  believe  that  tfae  additicmai 
work  is  small  enough  to  make  dynamic  order  restoration  practical.  In  this  sectkm  we 
examine  why  there  is  extra  work  and  estimate  the  cost  There  are  two  main  sources  of 
additional  work  for  the  debugger  when  compared  to  a  debugger  fw  sequential  pro¬ 
grams:  the  informatirai  that  must  be  computed  to  decide  if  commands  me  disallowed 
and  re-execution  when  commands  are  disallowed.  We  discuss  each  of  them  below. 

During  nonnal  execution,  no  additional  work  is  necessary.  When  a  program  stops  exe¬ 
cuting  because  of  a  breakpoint,  exception,  or  intemipt.  the  debugger  first  computes  the 
virtual  time  of  each  processor  to  determine  the  current  time.  Determining  the  virtual 
time  lakes  a  constant  cost  per  processoc  After  the  cunent  time  is  set,  the  debugger  rolls 
forward  all  processors  with  late  operations.  Tbe  ddmgger  must  compute  tbe  LUB  for 
each  processor  when  roiling  forward.  Computing  tbe  LUB  of  a  processor  is  linear  in  tbe 
size  trf  tbe  program.  Rolling  forward  takes  time,  bnt  since  useful  work  is  being  exe¬ 
cuted,  we  do  not  consider  this  extra.  After  roll  forward  is  complete,  tbe  debugger  com¬ 
putes  tbe  set  of  early  qteiations  on  each  processor,  whicb  is  linear  in  tbe  program 
length.  From  tbe  set  of  early  operations,  tbe  ddaigger  cm  compute  tbe  set  of  variables 
with  early  reads  and  writes,  wUch  is  also  linear  in  the  program  length.  Conputing  early 
reads  and  writes  with  a  resolution  of  individual  array  elements  can  require  some  data 
flow  analysis.  However,  this  information  can  be  precomputed  at  compile-time.  All  the 
work  described  above  must  be  done  every  time  the  program  stops  and  is  linear  in  pro¬ 
gram  length  and  tfae  number  of  processtxrs.  If  either  tbe  progimn  size  or  tbe  number  of 
processors  is  so  large  that  the  time  is  significant,  we  could  use  tbe  parallel  processor  to 
compute  the  information;  the  problem  can  be  partitioned  by  dividing  tbe  program  or 
dividing  by  processor. 

If  a  command  is  disallowed,  tbe  user  may  decide  to  re-execute  the  program.  If  tbe  pro¬ 
gram  is  re-executed  frequently,  tfae  cost  can  be  significant;  in  stmie  cases,  it  could  be 
better  to  execute  tbe  program  with  a  sequential  schedule  so  that  commands  are  never 
disallowed.  The  nun^  of  times  it  is  necessary  to  re-execute  a  program  depends  on 
what  variables  are  examined  and  modified  and  where  breakpoints  ate  set,  and  is  thus 
dependent  on  the  user.  We  believe  that  tbe  expense  of  re-execution  will  in  general  be 


SOURCe-LEVEL  OEBUOGMQ  OF  AimMMTICAU.Y  FARALLEUZEO  PROGRAMS 


55 


Th>  thfWKl  ipUtting  tmwIorniMMon 


saMdl.  e^edaily  when  it  is  ccmpared  10  itie  cost  of  executing  wtdi  a  seqmtial  scbed- 
ule. 

Hnt,  if  a  ptognok  has  a  &ctor  of  n  speedup  when  execuiiDg  CO  ibe  pualkl  macbioe.  the 
ddu^ger  would  have  to  renn  the  progfam  n  times  befoie  execution  would  be  as  stow  as 
sequential  execution.  For  massively  parallel  machinea.  a  can  be  quite  high. 

Second,  afierie^iecuting  the  program  once,  the  pcognun  is  in  a  stme  without  eariy 
operations,  so  no  commaids  are  disallowed  m  that  state  and  fe^xecutton  is  not  needed 
again  until  the  user  continues  execution. 

Third,  the  user  can  use  consistent  breakpoints  to  ensure  that  re-execution  is  neva  neces¬ 
sary.  As  ex{dained  in  the  previous  sectton.  if  we  set  a  consistent  breakpoint  in  a  parallel 
loop  the  program  execines  sequendafly,  so  any  breakpoint,  exception  or  intemipt  would 
puttheprograminastiue  without  any  early  operattons.  In  the  worst  case,  the  use.  can 
use  consistent  breakpoints  to  execute  the  program  with  a  sequential  schedule. 


4.5  Related  work 


GigNa  studies  the  problem  of  providing  source  level  (Mugging  for  programs  that  have 
been  trace  scheduled  for  VLIW  (Very  Long  Instnicli(»  Word)  machines  [Gupta  88]. 
Since  a  VLIW  can  execute  many  operattons  in  the  same  instructton  word,  it  can  be 
viewed  as  a  dghtly  synchronized  parallel  machine.  Our  work  focuses  on  the  dynamic 
nature  of  execudon  on  MIMO  machines,  hi  a  MIMD  machine,  the  processors  are  not 
dghdy  coupled  and  the  reladve  Older  of  execudoD  of  two  operations  on  different  proces¬ 
sors  can  change,  which  creates  the  need  for  dynamic  order  restoradon.  It  is  unneoessaiy 
for  a  VLIW  because  the  parallelism  is  statically  scheduled.  Gupta’s  work  focused  on 
stadc  reordering  of  operadons.  which  we  do  not  consider  in  this  thesis.  However,  we 
can  apply  his  results  to  the  problem  of  strucmial  mapping.  For  this  reason,  Gupta's  work 
on  the  structural  problem  is  complementary. 

Phieo  and  Soffit  study  source-level  (Mugging  for  automatically  parallelized  code 
[Pineo  91].  They  only  attempt  to  solve  the  problem  of  presenting  the  correa  data  values 
to  the  user,  they  do  not  consider  modifying  data  or  control  flow. 

Their  method  exploits  the  features  of  single  assignment  languages.  Over  the  lifetime  of 
aprogram,  a  variable  will  have  many  verskms,  one  versin  for  each  time  it  is  assigned  a 
value.  They  maloe  the  versions  explicit  in  the  program  by  converting  the  imperadve, 
sequential  program  to  a  .sequential  program  in  single  assignment  form.  Every  time  a 
variable  is  assigned  a  value  in  the  imperative  progrmn,  a  new  variable  with  a  unique 
name  is  created  in  the  single  assignment  program.  If  a  variable  is  assigned  a  value  in  a 
loop,  scalar  expansion  is  applied  so  that  every  iietattoo  will  assign  a  different  variable. 
If  tte  bounds  (ff  a  loop  are  not  known  before  the  l(x>p  begins,  then  space  must  be  incre¬ 
mentally  allocated  for  variables  during  the  execution  of  the  loop.  Each  position  in  the 
source  program  has  associated  with  it  the  versions  of  all  the  vmiables  thm  are  live  at  that 
point  Only  one  version  of  each  variaUe  can  be  live  atone  point  in  the  program.  After 
this  conversion,  the  program  is  parallelized. 


tKlUnCC-LEVEL  OEBUOtWta  OF  MnOIU11CAU.Y  PtUlAUJhJZEO  rStOORAMS 


Wben  deiwggmg  widi  our  metiiod,  die  debi^eriniiat  determine  if  tbe  cufitnt  value  (tf 
tbe  variable  is  die  correct  one  to  inspect  for  ibe  cunent  P  time.  In  ilietr  metbod.  every 
variable  is  only  assigned  one  value,  so  they  must  determine  wbicb  copy  of  tbe  variable 
is  tbe  correct  one  for  tbe  current  point  in  tbe  program.  To  determine  which  ctqiy,  they 
use  tbe  program  counter  in  the  source  program  and  the  value  loop  counters  in  the 
source  program.  In  their  system,  they  do  not  present  the  user  with  a  model  that  only  a 
single  loop  iieiaiioo  is  b^g  executed  at  one  dme.  so  the  user  must  provide  the  loop 
counter  values  when  examming  variables  (e.g.  examine  variable  a  fitom  iteradcn  S). 

As  nneo  ptrints  out,  tbe  single  Msignment  transfbrmaiion  is  not  sufBcient  to  solve  the 
problem  on  MIMD  machines.  Even  if  a  variable  can  only  be  assigned  to  once,  it  sdll  has 
two  values  during  the  lifetime  of  a  program:  uninitialized  and  initialized.  Aftm*  the  first 
version  of  the  variable  in  tbe  source  program  has  been  initialized,  it  is  sdll  possible  that 
the  copy  that  we  want  to  inspect  has  not  yet  been  initialized.  To  solve  this  problem, 
Pineo  proposes  initializing  aU  memory  locatimis  to  0  and  when  the  debugger  is  asked  to 
display  a  variable  that  hai^xns  to  be  0,  the  debugger  reports  that  the  value  is  0  (v  the 
value  is  not  ready  yet  (the  ddnigger  does  not  have  enough  infranation  to  distinguish  tbe 
two  cases).  Alternatively,  if  tbe  hardware  supports  detection  of  reading  uninitialized 
variables,  this  can  be  checked  automatically.  Our  virtual  time  scheme  eliminates  tbe 
^  need  for  special  hardware  and  still  can  give  precise  answers  about  the  values  of  vari¬ 
ables. 

There  are  four  key  differences  between  Pineo’s  and  our  work.  First,  we  provide  control 
flow  that  is  consistent  with  the  sequential  program.  Not  providing  source  program  con¬ 
trol  flow  places  a  greater  burden  on  the  user  of  understanding  the  parallel  structure  of 
the  program. 

The  second  difference  is  that  Pineo’s  method  requires  extra  space  for  tbe  storage  of  <fata 
a  large  change  in  the  parallel  code.  Tbe  conversion  to  single  assignment  form  can 
require  significant  increases  in  space  arui  computation  solely  to  support  debugging.  To 
reduce  tbe  space  requirements,  Pineo  perfonns  name  reclamation  to  re-use  storage  for 
multiple  versioDS  of  tbe  same  variable  when  it  isn’t  necessary  to  keep  all  the  versions 
around  for  debugging  or  parallelization.  In  benchmark  programs,  Pinro  measures  a  10% 
increase  in  storage  that  can  be  attributed  solely  to  ddaigging;  however,  the  number  can 
be  much  higher  for  individual  programs. 

The  third  difference  is  tlua  saving  multiple  versions  of  a  vari^le  ensures  that  a  current 
value  of  a  variable  can  be  found  if  we  run  tbe  program  long  enough  so  that  the  value  is 
initialized.  In  our  method,  the  current  value  might  have  been  overwritten  and  can  only 
be  found  by  reniniung  tbe  program.  Tbe  single  assignment  method  pessimistically  uses 
extra  space  and  computatico  in  case  tbe  user  might  need  to  see  a  value.  We  believe  that 
it  is  more  effective  to  oM  do  the  extra  work,  and  rerun  tbe  program  when  it  is  not  possi¬ 
ble  to  provide  a  current  value  of  a  variable  because  it  has  been  overwritten. 

The  fourth  difference  is  that  the  single  assignment  method  can  support  reordering  of 
stores  on  tbe  same  processor,  while  our  method  cannot  As  described  eariier,  this  capa¬ 
bility  comes  at  a  potentially  large  cost  and  the  user  must  pay  the  cost  even  when  there  is 
no  reordering  of  stores.  However,  saving  extra  copies  of  variables  to  achieve  the  same 
result  could  be  added  to  our  model  if  desired. 


SOIMCE-LEVEL  OEBUGGINO  OF  AUTtMUTICALLY  PARALLELIZED  PROGRAMS 


S7 


Th*  thVMd  apUMag  traiwfoniwtion 


4.6  Summary _ 

btbbciiapter,  we  (k&aetbetbfead  splitting  transfonnaiiooaiid  its  D  ftmctioii.  We 
show  that  tfaiead  spUiting  is  debuggable  if  a  sequential  scbedule  is  used.  In  a  sequential 
schedule,  we  execute  the  statements  of  diem  program  in  the  same  onkf  as  tbey  are  exe- 
culed  by  the  p  program.  This  onkring  is  determined  by  an  assignment  of  virtual  times  to 
we  allow  the  program  to  execute  in  parallel,  we  may  stop  in  a  state 
that  is  not  conststem  with  a  state  of  the  P  program.  If  we  allow  the  user  to  examine  or 
modify  variabies  in  this  stam,  the  behavior  of  the  debugger  might  be  correct.  The  ddNig- 
ger  must  detect  if  allowing  the  user  to  modify  or  examine  or  variable  will  cause  incor¬ 
rect  behavior. 


aOURCE-UEVEL  OQMQGMO  OF  SUTOMATICALLY  PaRAlLELIZED  PROORAMS 


CHAPTER  5 


Distribution 

transformations 


In  tbe.distribution  phase  of  compilation,  code  and  data  are  assigned  to  processors  for 
execution.  Disnibution  is  accomplished  in  different  ways  on  different  machines.  On 
shared  memory  machines,  the  paurallel  DO  can  be  implemented  with  forlt/join  parallel¬ 
ism,  where  a  master  thread  forks  off  otba  threads  which  each  execute  a  subset  of  the 
iteratioasiStBvens  90].  After  all  the  iterations  have  been  executed  the  threads  join,  so 
that  (mly  the  master  thread  continues  execution.  Compiters  for  distributed  memory 
machines  usually  use  a  different  model,  called  SPMD,  where  the  same  program  is  exe¬ 
cuted  on  all  processors.  Every  processor  executes  the  sequential  code  and  when  the  par¬ 
allel  loop  is  executed,  every  processor  executes  its  assig^  iterations  [Tseng  89). 
Synchrinuation  is  not  done  after  every  parallel  loop;  it  is  only  done  on  an  as-needed 
b^is.  The  thread  splitting  model  was  deigned  to  support  SPI^  programs.  The  rest  of 
this  chapter  will  assume  SPMD  type  loop  distribution. 

The  general  problem  of  deriving  the  p  program  for  a  particular  distribution  can  be  more 
difficult  than  generating  the  parallel  code  itself.  The  ^  program  (the  P  program  is  the 
step  before  thread  s{ditting)  is  essentially  the  parallel  program  combined  into  a  singte 
thread  of  control.  The  order  operations  in  this  thread  of  control  must  be  compatible 
with  the  order  of  operations  in  both  the  ca  program  and  the  a  program.  Furthermore,  the 
ctanmunication  in  the  single  thread  must  be  ordered  so  that  it  will  not  deadlock 
(receives  must  follow  sends). 

In  this  chapter,  we  describe  bow  U)  derive  the  ^  program  for  a  limited  domain  taken 
from  loop  based  parallelism.  From  a  description  of  the  distribution  of  the  program  we 
can  automatically  derive  the  ^  code  and  its  associated  D  functioa  To  give  some  context 
and  demcnstiate  the  relevance  of  the  domain  we  have  chosen,  we  briefly  describe  how  a 
compiler  that  takes  advantage  of  loop-based  parallelism  ctm  translate  a  loop  nest  into  a 
parallel  program  in  Section  S.l.  describe  tbe  domam  of  source  programs  in  Section 
5.2.  In  Section  5.3,  we  introduce  tbe  basic  method  for  distributing  itermioos  among  pro- 


SOURCC-LatfiL  DEBUQGiNQ  OF  AUTOMAffCALLY  PARAIXEUZED  PROGRAIIS 


59 


OMrlwtfoii  tnMMfomwtioiM 


oesfon.  TUs  is  suflideM  to  <to  tiw  block  nu^if>aip  aiq)ioycd  by  moit  coiBpilen.  Sec- 
tkn  S.4  describes  how  camminkarioo  csn  be  tnsened  into  the  pfognsn.  and  Sectioa  3.5 
presents  the  basic  method  for  distributing  dso.  In  Section  S.6.  we  extend  the  basic 
method  for  distribution  of  iteratioos  and  data  so  that  we  can  support  cyclic  and  biock- 
cydic  diatributioos  of  iterations  and  data  as  well 


5.1  Pafaiiellzing  compllefs _ 

For  distributed  memory  machines,  compilen  usually  distribute  the  data  and  then  use  the 
location  of  the  daa  to  determine  where  to  perform  the  computation.  One  commonly 
used  strategy  is  the  owner  computes  rule,  which  assigns  the  computation  to  the  proces¬ 
sor  that  owns  the  variable  where  the  result  is  to  be  stored.  In  this  case,  the  disoribution  of 
iterations  matches  the  distribution  of  data. 

IVto  important  factors  for  determining  how  to  distribute  tbe  data  are  locality  and  load 
bahmdng.  A  ptt^pam  bas  good  locality  when  the  data  items  that  a  processor  needs  are 
usually  on  that  processor.  If  tbe  processor  must  exectne  a  statement  that  needs  a  value 
6om  another  prooessot  then  that  value  must  be  fetched  by  the  processor  The  better  the 
locatity;  the  less  time  spent  shipping  data  around,  which  leatb  to  good  performance.  Tbe 
other  foctor  is  load  balancing:  if  the  work  of  a  program  is  not  evenly  dtstributed,  thep 
some  processors  win  be  under-utilized  and  tbe  performance  wiU  be  poor.  Locality  and 
load  balancing  often  place  conflicting  constraints  on  the  mapping  of  data  and  computa¬ 
tion.  If  aU  the  daa  and  ccmpntaiion  were  placed  on  one  processor,  then  no  communica¬ 
tion  would  be  necessary  at  run  time.  However;  because  one  processor  is  doing  ail  the 
work,  the  load  balancing  is  not  good.  The  more  computation  is  distributed,  the  better  the 
load  balancing,  but  tbe  locality  becomes  worse  which  leads  to  more  communication. 

Nested  loops  with  regular  access  patterns  are  good  candidates  for  patallelizatioo  for  dis¬ 
tributed  memory  machines  because  the  compiler  can  trade  off  locality  aid  load  balanc¬ 
ing  at  compile  time.  Algorithms  for  determining  tbe  best  distribution  are  beyond  tbe 
scope  of  this  thesis,  but  have  been  studied  extensively  in  tbe  liteiature[Gupta 
92]tWholey  91]IU  91]nCDObe  90]. 

Ibe  example  in  FIGURE  S-1  illustrates  some  of  the  constraints  for  deciding  bow  to  dis¬ 
tribute  dare  aid  computation  and  tbe  different  ways  for  mapping  data  to  processors.  In 
the  loop.  Iteration  0  stores  its  result  into  a  II] ,  so  if  a  [1]  resides  on  processor  1,  then 
we  want  to  execute  iteration  0  on  processor  1.  Iteration  0  also  accesses  array  elements 
bCO]  and  b  1 1],  so  we  want  to  put  those  dements  on  processor  1  as  well.  Essentially, 
when  distribute  tbe  array  b  across  tbe  processor  array,  we  shift  it  to  tbe  right  by  (mk 

and  assign  two  elemena  per  processor.  Aligning  location  based  on  locality  is  called 
alignment  In  this  example  the  array  b  is  aligned  to  the  array  •  so  that  no  nm  time  com- 
municatirai  is  necessary  to  execute  an  iteration  of  the  loop.  The  domain  examined  in  this 
chapter  allows  us  to  express  mappings  similar  to  those  necesmry  for  tbe  example  in 
FIGURE  5-1. 


«0 


aOUfKX-UEmDEBlXaGMaOFMnOMATia^  ^ARAtLEUZSD  PROGRAMS 


Oomain  dMCriptlon 


FIGURE  &>1  Th*  ptacainMtt  of  data  and  Santlioiw  for  a  loop 

for  (i  a  0;  i  <  10;  i+-*-) 

ati+11  «  bt2*ij  ♦  bC2*i+l]; 


RrooessorO  Procaaaorl  Procaaaora 


aCO] 

atH 

a[2] 

b[0],  b[l] 

b[2],b£33 

iteration  0 

iteration  1 

5J2  Domain  description 


In  our  dnmain,  souice  programs  are  limited  to  nested  loops.  The  iterations  of  a  loop  may 
be  distributed  or  replicated  across  a  set  of  processors  or  all  executed  on  the  same  proces¬ 
sor.  We  will  use  the  C  syntax  of  for  kx>ps.  However,  loops  where  iterations  are  distrib¬ 
uted  must  be  of  the  following  fonn; 

for  [counter  *  low;  counter  <  high;  counter  -*•*  step) 

For  loop  control  low,  high,  and  step  may  be  expressions  but  may  not  change  value  once 
the  loop  has  begun.  The  body  of  tte  loop  may  contain  any  statements  in  the  language, 
including  conditionals  and  loops,  but  may  not  change  the  loop  counter.  This  is  essen¬ 
tially  the  semantic  equivalent  of  a  FORTRAN  do  loop. 

To  specify  the  parallelizing  transformation,  the  cmnpiler  must  determine  tte  mapping  of 
loop  iteratioDS  to  processors,  the  descriptitm  of  iiuer-processor  conununication.  and  the 
decomposition  of  data.  To  simplify  the  presentation,  the  geoeiaticm  of  0  code  will  be 
presented  as  a  three  step  process:  distributing  iterations,  inserting  cranmunicatxm,  and 
distributing  data. 

The  first  step,  iteration  distribution,  restrucnires  the  loop  and  adds  processor  itutex 
labels  so  that  the  appropriate  loop  iterations  will  be  performed  on  each  processor.  Com¬ 
munication  statements  are  inserted  into  the  loop  in  the  second  step.  Fmally,  glo)»l 
names  and  addresses  for  distributed  data  are  replaced  with  local  names  in  the  third  step. 

Code  generation  and  debugging  procedures  are  described  for  each  step.  The  final  output 
will  be  the  p  program  and  the  debugging  function  for  the  transfcMtnation  between  the  a 
program  and  the  0  program. 


SOURCE-LEVEL  DEBUQQMO  Of  AUTOMATICALLY  PARAU.EU2EO  PROGRAMS 


61 


Dirtrlmtloii  tiwloffMtioM 


5.3  Basic  loop  itaration  distribution 


Id  this  section,  we  introduce  the  basic  method  for  distributing  loop  Herations.  Mote 
complicated  distributions  build  upon  this  model. 

Ihe  (Ustributioa  of  loop  iterations  to  processors  will  be  described  as  a  mapinng  between 
dcawnts  of  the  iteration  space  and  the  processor  space.  The  mappmg  is  detennined 

the  compiler.  An  iuration  space  for  a  9  deep  loop  nest  is  the  9  dhnensioeal  space 
Each  point  in  the  space  conesponds  to  one  1^  iteration;  the  value  of  the  loop  com 
fiv  that  iteration  are  the  coordinates  ci  the  point.  FIGURE  5-1  shows  a  doubly  nested 
loop  and  its  coneqiooding  2  dimensiooal  iteration  space.  The  processor  space  for  a  r 

dimensional  processor  array  is  the  r  dimensional  space  S'.  Each  point  corresponds  to 
one  processor.  Tbe  mapping  assigns  a  set  of  iterations  to  be  ececu^  on  each  processor. 


FIGURE  5-2 


iteration  space  of  a  doubly  nested  loop 


for  (i  =  0;  i  <  6;  i-*-*-)  { 
for  (j  =  0;  j  <  i;  ( 

} 

) 


i 


Each  processor  executes  its  assigned  iterations  in  tbe  same  order  that  they  were  exe¬ 
cuted  in  tbe  originai  program. 

Weonly  consider  a  limited  class  of  linear  mappings.  A  linear  mapping  from  a  q  dimen¬ 
sional  iteration  space  to  a  r  dimensional  processor  space  can  be  described  with  T,  a 
rxq  matrix  and  0,  a  r  element  vector.  The  Tmatrix  and  0  vector  are  chosen  by  tbe 
compiler  which  decides  the  mapping.  Tbe  mapping  is  cranputed  by: 

Equation  5.1  p  »  L^(0 


SOURCE-LEVEL  OEBUQOMQ  OF  AUTOMARCAIXY  FARAUJEU2EO  PROGRAMS 


Basie  bop  Maraiion  dMiBxitlofl 


wbere  i  is  apomtin  tbe  beratkn  space  (UeratioH  index)  and  tbe  lesuit  p  is  a  location  in 
tbe  processor  space  (processor  in^).  Since  processor  indexes  are  composed  of  inte¬ 
gers,  die  result  of  r(i)  -t-  o  is  always  roimded  down  (floor  function).  tint 

result  in  a  processor  index  that  does  not  correspond  to  an  actual  processor  index  are 
invalid  and  casmot  occur  if  the  compiler  is  correct 

Mappings  wfll  (iifther  be  restrkted  so  that  in  every  row  and  odumn  of  r  at  most  one 
elemait  is  non-zero.  In  effect  the  iteiatioo  space  is  divided  into  slices  that  are  puallei  to 
one  axis  of  the  iteration  space.  These  slices  are  distributed  along  a  dimensaon  of  tbe  pro¬ 
cessor  space.  Tbe  dotted  lines  in  FIGURE  S-2  show  one  possibie  way  of  slicing  die  iter¬ 
ation  space.  Our  notation  pennits  more  complicaied  linear  mappings,  for  example 
mappings  that  slice  the  iteratim  space  with  cuts  that  are  not  parallel  to  tbe  axes.  Sup¬ 
porting  more  general  linear  mappings  does  not  make  debugging  more  difficult,  but  com- 
plicates  the  gmeratioo  of  loop  bounds  for  distributed  loops;  it  is  not  cxmsidered  in  this 
thesis. 


When  it  is  necessary  to  map  the  samg  loop  iteration  to  more  than  one  processor,  an  ele¬ 
ment  of  T  may  also  be  tbe  special  symbol «.  Arithmetic  expressions  that  contain  «  will 

be  evaluated  as  follows:  V  (c»tO) :  •  x  (c)  =  • ,  *  x  (0)  =  0,  -  =  * , 

c 

*taoAn  -  *,aod  *  -t-c  -  *.  If  a  component  of  a  processor  index  is  ctxnputed  to  be*, 
then  that  component  takes  on  aU  possible  values  for  that  dimension  of  the  array.  For 


example,  in  a  2x5  array,  the  processor  index  *  represents  the  processor  indexes  ^ 


Tbe  mapping  described  above  is  sufficiently  general  to  do  tbe  block  distribution,  disui- 
butkm  by  row,  and  distribution  by  column,  which  is  used  by  most  parallelizing  cmninl- 
ers.  It  does  not  include  the  cyclic  mapping  because  it  is  non-linear.  In  Section  5.6,  we 
extend  tbe  basic  iteration  distribution  to  handle  cyclic  distributions. 


5,3.1  Examplns 

In  this  section  we  will  give  examples  of  various  iteration  mapinngs  and  their  transfor¬ 
mation  matrices. 


The  first  example  is  a  mapping  wbere  tbe  offset  vector  o  is  non-zero.  The  mapping  in 

FIGURE  5-3  converts  an  iteration  index  into  a  processor  index  by  adding  ^  .  Each 

square  represents  one  iteration,  tbe  coordinates  of  the  iteration  is  tbe  processor  index 
tbm  executes  tbe  iteration.  We  have  labeled  the  iteratkms  on  tbe  corners  with  their  itera- 


tioo  indexes. 


In  this  example,  a  row  of  processors  executes  one  entire  iteration  of  tbe  outer  loop.  Imr- 
ations  of  the  inner  loop  ate  spread  across  each  row.  If  we  swap  the  columns  of  tbe  map- 


SUURCE-LEVEL  OCBUQQMQ  OF  AUTOMATICALLY  PARALLEUZB)  PnOGRAMS 


63 


OMribution  tmwfonratioM 


FIGURE  M 


ttanrion  mipping  with  an  ofhat 


source  program 

for  (i  =  0;  i  <  6;  i**)  { 
for  (j  a  0;  j  <  i;  j-f+)  { 


} 


assignment  of  iterations  to  processors 


qiedflcation  of  mapping 


dunensionO 

ofpnxxssor 

array 


dimeoskm  1  of  processor  array 


ping  matrix,  ttien  we  would  obtain  a  mapping  that  is  tbe  transpose  of  tbe  one  above; 
columns  execute  outer  loop  iteiatioos  and  inner  loops  ate  spr^  across  each  colunm. 

Sometimes  it  is  desirable  to  map  ail  tbe  iterations  of  a  loop  to  a  single  processor.  Tbe 
mapping  matrix  in  FIGURE  S-4  assigns  each  iteration  of  tbe  outermost  loop  to  a  differ¬ 
ent  processor,  but  does  not  diuribute  tbe  inneimost  loop. 

If  a  loop  is  distributed  across  a  dimensioo,  but  more  iban  one  itemion  is  mapped  to  tbe 
sane  processor,  than  a  matrix  element  with  an  absolute  value  less  than  one  1  should  be 
used  fijr  tbe  appropriate  dimension.  Ibe  matrix  in  FIGURE  5-5  maps  2  iterations  of  the 
outer  loop  to  each  row  of  processors,  and  spreads  tbe  iterations  of  the  inner  loop  so  that 
each  processOT  gets  3  itcratkms  of  tbe  inner  loop  for  a  total  of  6  iterations.  Processors 
nea  tbe  boundary  of  tbe  iteration  space  will  get  less  iterations.  Using  a  coefiBcient 
greater  than  1  wUl  spread  iterations  so  that  not  every  processor  will  have  work  to  do. 
For  example,  if  tbe  value  3  is  used,  then  only  every  third  processor  will  execute  an  iter¬ 
ation. 

When  a  negative  coefficient  is  used,  the  iterations  with  successively  increasing  loop 
counter  values  can  be  mapped  to  processors  with  successively  decreasing  processor 


64 


sounce-LEVB.  oeBuoamo  of»j/tomaticau.v  parauleuzeo  programs 


B— te  loop  lUrHon  distribution 


(5.0H5^ 


O 

O 

□  (0,0) 


FIGURE  5^ 


Mapping  that  placaa  mora  than  one  iteration  on  the  same  processor 


^  h/2  o' 
Lo  1/3 


(5,3)-(54) 

(43),(4.4) 


□  (3.3) 


O  (U0),(1,1) 
°  □  (0,0) 


indexes.  A  *  in  F  will  result  in  a  pnxessor  index  that  ccmtains  a  *.  In  the  example  of 
FIGURE  S-6,  the  outer  loop  is  distributed  by  tow  and  every  column  in  the  row  executes 
the  entire  inner  loop  for  tfau  row. 


SOURCZ4.£VELOEBUG«UNQ  OF  AUTOMATICAU.YPARALXEU2B}  PROGRAMS  66 


HGURE  S* 


Mapping  wtian  hantiona  ara  dh^iHcatad  on  savaral  pracaaaora 


7 


6 

S 

4 

X  3 
2 
1 
0 


(5.0H5^ 
(4,0H4.4) 
aOH3J) 

ceeece  (Z0H2^) 

ODDQQD  (i.oHi.i) 
□  □□□□□  (0.0) 


0  1  2  3  4  5 


y 


5.3JZ  Cods  ganaration 

Tlie  code  generaiioa  proUem  is  to  take  a  mapping  and  a  loop  nest  and  generate  a  ^  pro¬ 
gram  wUb  a  processor  index  labelling.  Hie  labelling  should  be  chosen  so  that  when 
thread  flitting  is  apptied.  the  output  will  be  a  parallel  program  where  each  processor 
executes  the  appropriate  set  ai  iteradons.  Fuithermore,  the  order  of  execution  of  iiera- 
dons  in  the  P  progima  must  be  the  same  as  it  was  in  the  original  program.  If  the  com¬ 
piler  wants  to  change  the  order  of  execudoo  of  iteradons  on  asingle  processor,  then  that 
can  be  done  as  a  pan  of  the  restructuring  phase. 

Each  tow  of  the  r  matrix  determines  bow  iterations  are  mapped  to  one  dimenskm  of  the 
processor  array.  If  an  element  in  column  c  of  row  r  is  neithff  a  zero  nor  a  *,  then  the  iter¬ 
ations  of  the  ctb  loop  in  the  nest  are  distributed  across  dimension  r  of  the  processor 
array.  If  all  the  elements  of  row  rare  0,  then  loop  iteradons  are  not  distribined  across 
dimension  r  of  the  processor  areay.  If  an  element  in  the  row  is  a  *,  then  iteradons  are 
repiicatBd  on  that  dimensioa  of  the  processor  airay. 

Algorithm  S.l  takes  amapping  and  source  code  and  outputs  the  ^  program.  The  algo¬ 
rithm  is  divided  into  4  phases,  as  marked.  Phase  1  analyzes  the  T  matrix  to  decide  which 
loops  are  distribuiBd,  replicated,  and  not  distributed. 

Phase  2  handles  the  case  when  work  is  not  distributed  across  a  dimension  of  the  proces¬ 
sor  space.  An  example  is  when  iteradons  are  only  assigned  to  one  column  of  prooessots 
in  a  2  dimensioaai  array;  work  is  not  distributed  across  the  coiumns  of  the  processor 
space.  We  use  conditionals  to  prevent  the  processors  that  are  not  assigned  any  iteradons 
from  executing.  We  first  emit  a  dine  to  distribute  copies  of  the  code  across  the  dimen¬ 
sion.  Then  an  1£  statement  is  used  to  restrict  execudoo  to  the  appropriate  processors. 


SOURCe-Umn.  OeBUOOMQ  OP  Aim>IIA71CAU.Y  PAfVUXEliZEO  FnOGRAMS 


B4Mie  loop  Itaratioii  (tetrOMitiofi 


Pboae  3  of  tbe  algoritfam  emits  eocb  loop  of  the  nest.  If  tbat  loop  is  distritmted,  a  diot 
is  emitted  that  distributes  the  loop  code  across  the  appropriam  dimensioii.  Iieratioos  are 
disiribuiBd  from  low  values  to  high  values  if  there  is  a  positive  value  in  loopscaU  and 
are  distributed  Cram  high  values  to  low  values  if  there  is  a  negative  value  in  loopscaie. 
For  some  progtams,  it  is  not  known  at  compile-lime  whether  the  loop  should  go  Cram 
low  to  hi|^  or  high  to  low.  An  example  is  where  the  sign  of  tbe  step  for  a  loop  is 
unknown  at  canqNle-time.  If  tbe  sign  is  not  known,  then  we  must  emit  two  loops,  one 
that  distributes  Cram  low  to  bi^  and  another  that  distributes  Cram  high  to  low.  Acondi- 
tional  selem  which  copy  should  be  executed  at  nm-dne.  Tbe  restrictioQ  that  PDL’s  be 
constant  forces  us  to  emit  two  loops  for  these  situations.  The  reason  for  constant  PEL’S 
and  the  eCEect  of  their  limitatioos  are  described  in  CHAPTER  7.  For  the  rest  of  this 
chapter,  we  assume  that  we  know  at  compile-time  whether  loops  should  be  distributed 
Cram  low'valued  processors  to  higb>valu^  processors  or  vice  versa. 

If  the  compiler  emits  a  loop  in  phase  3  that  is  distributed,  the  bounds  of  tbe  loop  are 
parameterized  by  fnracessor  index  so  that  each  instance  created  by  the  disc  will  exe¬ 
cute  its  assigned  loop  iterations.  If  the  loop  is  not  distributed,  it  is  emitted  unchanged. 

Phase  4  of  the  algorithm  emits  tbe  loop  body.  For  every  dimension  of  the  processor  for 
which  iteiatioas  are  replicated,  we  emit  a  rap.  Then  we  emit  the  body  itself,  with  a  PIL 
ofO. 

The  auxiliary  functions  comp.slice.begin  and  comp.slice.end  are  used  to  compute  the 
first  and  last  iterafions  tbat  a  processor  should  execute  in  a  distributed  loop. 


Algorithm  5.1  Generate  P  program  for  a  mapping  matrix  and  offset. 

Input 

depth  -  depth  of  loop  nest 
dim  •  (fimensioo  of  processor  space 
T[dun\[depth]  •  mapping  matrix 
oldim]  •  offset  vector 

counterldepth]  -  names  of  loop  counters  of  nest,  where 
counterlOJ  is  outermost  loop 
upperldepthl  -  upper  bound  expressitms  for  loop  nest 
lowerldeptk]  -  lower  bound  expressions  for  loop  nest 
stepfdepthj  •  step  size  expressions  for  loop  nest 
-  body  of  loop  nest 
Output  P  loop  nest 
lutermudiate  storage: 

looptypeldepth!  •  the  type  of  loop,  dist  if  the  loop  is  distributed,  rep  is  tbe  loop  is 
replicated,  md  odist  if  it  is  neither 

looppdimldepthj  •  dimension  oi  processor  array  tbat  tbe  loop  is  distributed  over 
loopscaleldepthi  -  scale  factor  if  loop  is  distributed 
Dl^~  iletatioas  are  distributed  across  dimension  difde  DIST 
R£/*- iterations  are  replicated  across  dimension  if  REP 
AU)fSr-iteraiioits  are  not  distributed  across  difflensiond  if  ds  NDIST 


SOURCE-LEVEL  OcBUQQMQ  Of  AUTOMAIICALLY  PARALLEU2B)  PROGRAMS 


67 


DhlflMitlow  tiwwionnilon* 


Mtthod: 

pbaati: 

NDiSrmREPMDlST»0x 
Ibnaeh  tow  r  of  r  do 
if  r  is  all  zero’s  thM 
NDISTMNDISTur: 

•mUf 

if  element  c  of  row  r  is  a**’  tlma 
REF»REPKjr: 
loopptUmlc]  a  rr 
looptypelcj  ■  tep; 

if  element  c  of  row  r  is  a  number  n  then 
D/STa  D/STu  r: 
looppdimlcj  a  rr 
loopscalelcI^K, 
looptypefc]  a  disc 
endif 
enddo 

/*  for  dimensions  of  tbe  processor  array  (bat  don’t  have  distributed  or  replicated  loops, 
emit  a  dirt  to  distiibitte  everything  across  the  dimension,  and  then  emit  an  if  so 
(hat  only  tbe  coma  processors  execute  work  */ 
ifNDIST#0thM 
fomch  d  e  SDiST 

emit  a  dint  for  dimension  d 
forcachde  DISTuREP 
emit  r«p  for  dimensioo  d 

/*  build  a  coodidoo  string,  with  one  condition  per  member  of  NDIST  *! 
forcach  d  e  NDIST 

coiuStion  a  cotuUtion  vj  "and  Jdld]  as  old}"', 
emit  if  statement  with  PEL  of  0  and  condition  as  tbe  condition 

endif 

phaseS: 

/*  each  iteration  emits  one  loop  of  nest  */ 
torl^OlodepthA  ( 

f*  emit  the  diet  if  tbe  loop  is  distributed  */ 
if  looptypell]  as  dist  then 

emit  diat  that  distributes  its  body  across  dimension  looppcUmllJ 
DIST  a  DIST  •  loof^xUrnlU; 

endif 

/*  for  ail  tbe  dhneasioos  of  tbe  processor  array  that  don’t  have  (fists  yet. 

emit  a  rap  */ 

Ihreach  ds  REPkjDIST 
emit  a  rap  for  dimension  d; 


90URC^^£VELt«BUGKaNQ0F  AinOMAIICMJLY  PAIIALIEUZED  PROOfUMS 


Basie  loop  hMHion  dtetrfiHilion 


/*  now  emit  tbe  loop  header  */ 
if  looptypelU  »»  diet  then 
emit  loop  with  PIL  of  0 
loop  lower  bound  is: 

GompjslkcjKpii(Jii[b)oppdim(in.lower[l],!oopscaie{l].offsetlliy, 
new  upper  bound  is: 

coaip_sikejasdUid[looppdunllI],upper[l],k)opscale[l]^j^t[lI)\ 
step  si2B  is  unchanged 

dadf  loop  is  not  distribuiBd,  emit  it  unchanged; 

1 

phase4: 

/*  loops  aie  (kne.  now  do  the  body  */ 

foreach  statement  s  e  body 
forcach  d  e  REP 

emit  a  rap  for  dimensimi  (t, 
emit  r  with  PIL  of  0 

} 


End  of  Algorithm 

Auxiliary  Ainctioos: 

oomp_slice_begin($licejower,c,of&et): 

maxGower.(slice-offset)/c) 
coinp_slice_end(siice,upper,c,offset); 
min(upper,(slice-fl-ofiFset)/c- 1) 

5.3wL1  Exampiaa 

To  illustrate  bow  tbe  aigtxithm  worics,  we  use  three  examples,  one  for  each  of  the  types 
of  loops:  distributed,  replicated,  and  not  distributed.  All  of  tbe  examples  will  share  tbe 
same  a  program,  which  is  the  two  deep  loop  nest  found  at  tbe  top  of  FIGURE  S-7.  Tbe 
program  is  mapped  to  a  2  dimensional  processor  array. 


In  part  (a)  of  FIGURE  5*7,  we  start  with  a  loop  which  is  neither  distributed  ntx-  repli¬ 


cated  All  the  iterations  are  executed  on  a  single  processor. 


2 

1 


.  After  phase  1  of  Algo¬ 


rithm  5.1  NDIST  contains  0  and  1  while  REP  and  DIST  are  empty.  In  phase  2,  we  emit 
a  dlut  for  dimensions  0  and  1,  but  dtm't  emit  any  rup’s.  We  then  emit  an  If  with 
conditicms  for  the  two  dimensions  contained  in  NDIST.  These  conditions  restrict  tbe  rest 
of  the  program  to  executing  on  a  single  processor.  In  phase  3,  when  we  emit  tbe  loq), 
there  are  no  members  in  DIST  or  REP  so  we  don’t  emit  any  rap’s  or  more  dlst’s.  The 
loop  and  body  are  emitted  unchanged 


In  part  G>)  of  FIGURE  S-7,  we  have  an  example  of  a  loop  where  tbe  outer  and  umer 
loops  are  distributed  After  pfaasel,D/5rhas  dimensions  0  and  1  in  itandA^D/STand 
REP  are  empty.  Because  A^D/ST  is  empty,  we  can  sldp  phase  2.  For  phase  3.  we  first 
emit  a  dlae  for  dioMnsioa  0,  then  remove  0  firom  DIST.  We  then  emit  a  rap  for  dimen- 


SOURCE-LEVEL  OEBUQtRNQ  OF  AUTOIIATICAiXY  PARALLELIZED  PROGRAMS 


Dtotribution  trmrforwttom 


RGURE  S>7  Code  gwMrition  «9(airplM  for  Algorithm  5.1 

a  program 

for  (i  a  m;  i  <  n;  i-t-t-) 
for  (j  a  0;  j  <  2;  j++) 
a[i] [j]  a  1; 


Mapping  3  programs 

(a)  Code  geoeratioo  for  a  maiqMng  with  out  distributed  and  replicated  loops 

dist  _a  a  0  to  _np[0]  with  <a, 0> 
diat  _b  a  0  to  _np[l]  with  <b, 0> 

<0,0>  if  (_idt0]  a=  2  &  _id[l]  aa  1) 

<0,0>  for  (i  a  m;  i  <  n;  i++) 

<0,0>  for  (j  a  0;  j  <  2;  j++) 

<0,0>  aCiltj]  *  1; 


(b)  Code  geneiatioa  for  mapping  with  distributed  loops 


r  = 


1  0 
0  1 


dist  ^a  a  0  to  _np[0J  with  <a,0> 
rep  _b  a  0  to  with  <0,b> 

<0,0>  for  (i  a  comp_slice_begin(_id[01 ,m, 1, 0) ; 

i  <  coinp_slice_and(_id[0]  ,n,  1,  0)  ; 
i++) 

dist  _c  a  0  to  _np[l)  with  <0,c> 

<0,0>  for  (j  a  comp_slice_begin(_id[l] , 0, 1, 0) ; 

j  <  conip_slice_end(_id[l) ,  2, 1, 0)  ; 
j++) 

<0,0>  aCi][j]  a  1; 


(c)  Code  generahon  fer  mapping  with  replicated  loops 

rep  _a  a  0  to  _npt0]  with  <a,0> 


rep  _b  a  0 

to 

_np(l]  with 

A 

tr 

o 

V 

<0, 

,  0> 

for  (i  a  a 

i;  i 

<  i 

n;  i++) 

rep  _a 

a  0 

to 

_np{0]  with  <a. 

0> 

rep  _b 

a  0 

to 

_np[l]  with  <b. 

0> 

<0. 

0> 

for  (j 

=  0; 

1  j 

<  2 ;  j  ++ ) 

rep 

• 

>  0 

to  _np[0] 

with 

<a 

rep 

=  0 

to  _np[ll 

with 

<b 

<0, 

0> 

a[ij 

tj] 

a  ^ 

1; 

70 


SOURCe-LEVhi.  OCBUOaMO  OF  AUTOMATtCALLY  PARAUEUZED  PfWQRAMS 


Batie  loop  hwalion  distribution 


skn  1,  followed  by  the  outer  loop.  On  the  second  iteraiian  of  the  loop  in  phase  3,  we 
emit  a  dint  for  Hiiwawirw  1  and  remove  it  fimn  DIST.  We  don't  emit  any  rap’s,  and 
then  we  emit  the  loop.  In  phase  4,  we  emit  the  body. 

hi  pan  (c)  we  have  an  exampfo  of  a  loop  that  is  replicated  on  all  processors.  Afto' phase 
1.  REP  contains  0  and  1,  DIST  is  empty,  and  NDIST  is  empty.  Since  NDIST  is  empty, 
we  skip  phase  2.  In  phase  3  for  the  first  iteration  of  the  loop,  we  emit  rnp’s  for  the  0  and 
1  dhnensioo,  we  then  emit  the  outer  loop  unchanged.  In  the  second  iteration,  we  do  the 
same  thing.  In  {fiiase  4,  we  emit  rap’s  for  the  0  and  1  dimenaon,  followed  by  the  body. 

S.3.3  Dobuggifig 

With  respect  to  debugging,  iteration  distributkm  mainly  alters  the  ctmtrol  flow.  The  con¬ 
trol  flow  is  dunged  by  the  addition  of  extra  code  to  ensure  that  every  processor  exe¬ 
cutes  it  assigned  iterations.  We  discuss  ctmtrol  flow  first,  and  then  describe  data 
structures  afterwards. 

For  the  run  and  where  ctmunands,  the  D  function  must  make  it  appear  that  the  control 
flow  of  the  target  program  is  the  same  as  the  control  flow  of  the  source  program.  When 
executing  the  taiget  program,  the  debugger  must  skip  the  extra  statements  that  are  exe¬ 
cuted  in  the  0  program  but  do  not  aMiespond  to  statements  executed  in  the  jource  pro¬ 
gram.  We  can  skip  statements  by  repeatedly  single  stepping  until  we  are  past  them, 
which  can  be  done  if  the  compiler-inserted  statement  cannot  cause  exceptions. 

There  are  two  types  of  statements  that  the  p  program  executes  that  do  not  c(»re$pond  to 
statements  in  the  a  program.  Ibe  first  are  if  statemmits  inserted  by  phase  2  of  Algo¬ 
rithm  5.1.  These  can  be  easily  recognized.  The  second  type  are  called  inactive  loops. 
Inactive  loops  are  assigned  iterations  to  execute  that  are  all  below  the  low  bound  or  all 
above  the  high  bound  of  the  original  loop  in  the  a  program. 

To  explain  vdiy  inactive  loops  occur,  we  use  the  example  in  FIGURE  S-8.  In  this  figure, 
we  have  expanded  the  program  frcmi  part  (bl  of  FIGURE  5-7  assuming  a  2x2  processor 
array.  A  processor  in  row  i  executes  iteration  i  of  the  outermost  loop.  If  the  low  bound  of 
the  outer  loop  has  a  value  of  1,  then  processors  in  row  0  do  not  execute  any  iterations,  hi 
this  case,  the  loqi  <»  line  1  is  an  inactive  loop.  If  the  high  bound  of  the  outer  loop  is  0, 
the  processors  in  row  1  tfo  not  execute  any  iteratxms.  In  this  case,  the  loop  on  line  12  is 
an  inactive  loop.  An  inactive  loop  is  dete<^  by  comparing  the  low  and  high  bounds  for 
the  original  loop  to  the  low  and  Ugh  bounds  for  a  distributed  loop.  If  the  bounds  for  a 
loop  are  completely  outside  the  original  bouixls,  then  the  loop  is  inactive.  From  the 
example,  if  tte  original  low  bound  (m)  is  1  and  the  Ugh  bound  (a)  is  1,  then  the  loop  on 
line  1  is  inactive  because  its  Ugh  bound  is  0.  Being  inactive  is  not  a  static  attribute;  the 
stanis  can  change  every  time  a  loop  is  executed. 

For  data,  we  exclude  modifying  loop  counters  while  ddiugging.  Changing  a  loop 
counter  of  a  distributed  loop  would  require  that  we  also  change  the  current  position  in 
the  program.  Fdr  example,  if  we  set  the  loop  counter  to  iteration  1,  we  would  also  have 
to  change  the  current  position  to  the  loop  wUch  executes  iteration  1  (use  the  T  matrix  to 
compute  wUcfa  statement).  - 


SOURCE-l£/EL  OEBUGtaNQ  OF  AaTOIIAT1CAU.Y  PARAUJEUZEO  PnOGRAIHt 


71 


Olstributioii  ttamrfoniMtHiw 


HGURE 


P  progrwn  from  part  (b)  of  FIGURE  S-7  axpandad  for  a  2x2  proeasaor  array 


1 

<0, 0>, <0, 1> 

for  {  i  »  cMiq?^slice_b«gin(_id[0]  ,in,  1, 0)  ; 

2 

i  <  caBp_slice_end (_id [ 0 ] , n , 1 , 0 ) ; 

3 

i++) 

4 

<0, 0> 

for  (  j  »  ccogqp_8lics.begin(_id[l] ,  0, 1, 0)  ; 

5 

j  <  coinp_slics_end(_id[l]  ,2, 1,  0)  ; 

6 

j++) 

7 

<0,0> 

alii [j]  *  1; 

8 

<0, 1> 

for  (  j  a  con^_slice_b«gin(_id[l] , 0, 1,  0) ; 

9 

j  <  conp_3lice_end (_id [ 1 ] , 2 , 1 , 0 ) ; 

10 

j++) 

11 

<0,1> 

aliJlj]  =  1; 

12 

<1, 0>, <1, 1> 

for  (  i  a  comp_3lice_begin(_idt01 ,m, 1, 0) ; 

13 

i  <  comp_slice_and (_id [ C ] , n , 1 , 0 ) ; 

14 

15 

<1,0> 

for  (  j  a  comp_slice_begin(_id[l] , 0 , 1, 0 ) ; 

16 

j  <  coaip_slice_end(_id[l]  ,2, 1, 0)  ; 

17 

j-^+) 

18 

<1,0> 

aliltj]  a  1; 

19 

<1,1> 

for  {  j  a  comp_slice_begin(_id[l) , 0, 1, 0) ; 

20 

j  <  con5J_slice_end  (_id  [  1  ] ,  2 , 1 , 0 )  ; 

21 

j++) 

22 

<1, 1> 

ati) [ j)  a  1; 

llK  D  functioa  for  iteratioo  distribulioa  can  be  found  below,  'llie  run  conunand  takes  a 
list  of  statements  in  the  brr^kpoint  list  and  sets  a  breakpoint  for  every  copy  of  tbe  state¬ 
ment  (tbere  is  one  copy  on  eacb  processor).  It  then  nms  tbe  piognun.  If  the  ]»t>gram 
stops  at  an  inserted  statement  or  an  inactive  loop,  then  it  single  steps  until  it  reaches  a 
statement  that  is  not  Single  stepping  over  loops  that  are  inactive  may  not  be  possible  if 
the  execution  of  the  kx>p  control  causes  an  exception.  Tbe  debugger  must  pick  tbe  first 
processor  that  does  not  execute  an  inactive  loop  to  report  to  the  user  as  tbe  current  line. 
This  case  is  not  included  in  tbe  D  function.  Tbe  command  simply  peels  off  tbe 
processor  index  off  tbe  label  when  reporting  the  current  location.  Tbe  sat  command 
prevents  tbe  user  from  modifying  a  loop  counter  and  tbe  szamins  command  jist 
passes  the  command  through  to  tbe  lower  level  debugger. 


72 


S7URCE-IJEVELDeEKIuOMQOFAUTOIIAlK:AU.YPAfULLEUZED  PROGRAMS 


CofiwnunJcatioii 


DI>ertttonDbtributMiK<falaclucoinin«nrt.Maifep»ograro,bpis4ap^ 

{ 

oewbptsa  0 
if  (command  «■  nm)  ( 
foreach  labd  1  €  bpts 

fofeadi  piocessor  index  p 
DBWbpO  a  oewbpo  u  (pj) 

newstam  a  apply<fini(dstatck)itst(dstack)jnin,staie.p4ib4J>) 
if  we  stop  at  an  inaened  statanent  or  an  inactive  loop 
single  step  uadi  we  leadi  a  statement  that  is  oeitber 
letum  newstate 

1 

if  (command  where)  { 

w  a  appiy(fifst(dstatdc).rest(dstack),wbeie4tate  JUl^) 
peel  the  processor  index  off  w  and  return  just  the  label 

1 

if  (command  examine)  ( 

renim  appiy(hrst(dstack}jest(dstack),examine,staie«lnnameJ.); 

) 

if  (command  a*  set)  { 

if  (name  is  a  lot^  counter  for  a  distributed  loop) 
set  is  not  allowed 

else  return  apply(fiist(dstatck)jest(dsiatck).seutate,.Lname.value); 

} 

) 


5.4  Communication 

In  this  section,  we  describe  bow  the  compiler  can  insert  commtinicatioD  into  the  parallel 
program  and  how  it  affects  the  debugger.  Cooununicatioo  between  processors  is  neces¬ 
sary  when  one  processor  must  access  data  that  is  local  to  anodier  processor.  Because  we 
only  study  the  problem  of  finding  bugs  in  tbe  user  program,  communication  bas  little 
impact  on  debugging.  S«nd  and  xacalTs  operatioos  are  not  visible  to  the  user  in  the 
original  sequential  program  and  thus  communication  code  can  be  igneued  by  the  ckbug- 
ger.  We  a,vnime  that  the  compiler  inserts  communicatioa  into  the  program  correctly  and 
that  communicatioa  never  fails. 

The  only  difficulty  lies  in  code  generation  of  the  ^  program;  because  there  is  only  one 
thread  of  control,  sends  and  receives  must  be  insetttd  into  tbe  ^  program  so  that  a 
receive  is  always  executed  alter  the  corresponding  send.  This  requirement  can  compli¬ 
cate  tbe  structure  of  some  ^  programs. 

The  compiler  specifies  communication  by  giving  tbe  positioa  in  the  loop  nest  and  the 
code  to  perform  the  commimicatioo.  Sometimes  it  is  only  necessa  ry  to  omimunicaie  at 
th*-  begiiming  or  end  of  processing  a  block  of  iterations  rather  than  on  a  per-iteiation 
level.  Since  the  blocking  of  iteratioos  is  not  visible  in  the  source  code,  tbe  compiler  can¬ 
not  name  a  positnn  to  insert  the  code  if  the  action  is  only  to  be  performed  c»ce  per 
block.  For  a  per-block  action,  the  compiler  must  give  the  nesting  level  of  the  loop  and 
whether  tbe  action  is  to  be  performed  before  or  after  processing  of  a  slice. 


SOURCC-LEVEL  OEBUGCUNQ  Of  AUTOMA  .KAU.Y  PARALUEUZEO  PftOGRAMS 


73 


OittrlHitlofi  ttansfermadem 


A  oomnunkaikn  actioD  a  perfonsed  on  evay  processor  dut  tte  coaaiiiii^ 
petforaied  oo;  if  the  itenKko  a  icpUcaied  across  a  <fioiaukn,  so  is  tte  coinmtnka^ 

If  a  variable  Kferenoe  in  a  loop  body  is  noO'locaJL  (bea  that  value  must  be  received  firam 
another  processor  during  the  course  of  the  compuialiOD.  The  variable  lefercnce  must  be 
reiriaoed  with  the  name  of  the  temponiy  in  which  the  value  is  received.  In  additioo  to 
the  communication  actions,  the  compiler  also  gives  a  set  trf  vatiaUe  reftaeaces  and  the 
expressions  whkh  replace  them. 

SA1  Coda  gnnoratfon 

sand’s  and  racaiva’s  are  mapped  to  the  same  processors  as  the  rest  of  the  loop  itera- 
tion.  In  general,  conununication  statements  inherit  the  processor  index  label  of  the  sur¬ 
rounding  iteration.  Per-block  communkation  is  treated  the  same  as  other 
communication.  It  is  inserted  inside  the  dlst:  before  the  loop  if  it  precedes  the  block 
and  after  the  loop  if  it  follows  tbe  block.  In  the  body  of  the  loop,  the  appropriate  array 
references  are  related  with  the  substitute  references. 

S.4.2  Exampins 

hi  tbis  section  we  will  give  examples  of  common  usages  of  communication  in  programs. 
^  win  show  the  user  code,  tbe  generated  code,  and  disoBS  bow  to  demonstrate  that  the 
program  can  still  be  executed  sequentially  after  the  communication  is  inserted. 

One  use  of  communication  is  to  transmit  values  that  are  produced  in  one  iteration  and 
consumed  in  a  iaier  one.  If  the  iterations  are  mapped  to  difterent  address  spaces,  then 
the  value  must  be  sent  ftom  tbe  processor  that  generates  tbe  value  to  tbe  processor  that 
uses  it  As  an  exmnpie,  consider  the  following  loop  that  sums  the  elements  of  an  array: 

sum  S  0; 

for  (i  a  0;  i  <  n;  i++)  { 
sum  a  sum  >  C[i} 

} 

There  is  a  flow  dependence:  the  value  of  eiai  used  in  tbe  right  band  side  is  the  vtdue 
stored  in  the  left  hand  side  in  the  previous  iteration.  If  the  iteration  distribution  is 

r  3  [p  ij  ando  s  successive  iterations  are  mapped  to  suo^ive  processors  and 

tbe  value  of  sum  must  be  sent  fttxn  processor  to  processor.  Ctxnmunication  code  would 
be  inserted  as  follows: 


74 


aOURCe-LEVEL  DEBUQQMQ  CF  AUTOIIAT1CAU.Y  PAfUUJLEUZED  PROORAMS 


Conwnuiiiefltion 


r*p  a  a  0  to  «np-l  with  <a> 

<0>  svnn  a  0; 

dist  a  a  0  to  _np-l  with  <a> 

<0>  for  (  i  a  conip_alica_bagin(_id,  0, 1,  0)  ; 

i  <  coaqp_slica_and(_id,n,  1,  0)  ; 
i++)  { 

<0>  racaivan  ( ( -1 ) ,  siun) 

<0>  sum  a  sum  C{i] 

<0>  sandn(  (1) ,  svim) 

} 


Each  cell  receives  sum  from  its  left  neighbor,  adds  to  it,  and  passes  it  oa  to  its  right 
neighbor.  Recall  horn  CHAPTER  2  that  the  semmttics  of  raealTan  and  sanda  imply 
that  the  first  cell  should  not  receive  anything  mid  the  last  cell  should  not  send  anything. 
This  communication  is  used  to  satisfy  a  flow  dependence.  Flow  dependences  always  go 
forward  in  virtual  time  because  the  use  must  occur  after  the  generation  of  the  value  in 
the  0  program. 

Another  common  communication  usage  is  a  broadcast  operation.  The  code  below  is 
mapped  to  a  1  dimensional  processor  array: 

for  (i  =  0;  i  <  n;  i++) 
c  s  bCi] ; 

The  iteration  dtstribution  is  T  »  [V]  and  o  »  (every  processor  executes  every  iter¬ 
ation).  and  the  array  b  is  distributed.  We  explain  data  distribution  in  the  next  section; 
assume  that  processor  n  has  array  element  b  (n] .  Every  processor  executes  every  itera¬ 
tion  of  the  outer  loop,  thus  it  needs  every  element  of  the  array  b.  For  every  iteration  of 
the  i  loop,  the  processor  that  has  b[l]  must  broadcast  its  value  to  all  the  otber  proces- 
sois. 


rep  a  =  0  to  _np(0]  with  <a> 

<0>  for  (i  =  0;  i  <  n;  i++)  { 

rep  a  =  0  to  _np[0]  with  <a> 

<0>  beast (b(i] , id[0] ==i, tnp) ; 

rep  a  s  0  to  _np[0]  with  <a> 

<0>  c  s  trap; 

) 

In  a  broadcast,  the  selected  processor  sends  out  its  copy  of  the  data  and  all  the  other  pro¬ 
cessors  must  receive  it  It  is  not  necessary  to  make  the  actual  send  and  receive  state¬ 
ments  in  a  broadcast  visible  to  the  user  who  wrote  the  a  program,  so  we  use  the  beast 
function  to  hide  the  details  of  the  broadcast  operation.  The  first  argument  is  the  data 
item  to  be  broadcast  the  second  argument  is  the  expression  that  is  true  on  the  processor 
that  is  the  source  of  the  broadcast  and  tbe  tbtrd  argument  is  tbe  destination  variable  of 
the  broadcast  The  reference  to  b(i]  in  tbe  loop  body  is  replaced  with  taqp.  All  tbe 
sends  and  receives  used  to  perfoim  one  beast  qieratioa  occur  in  a  single  iteration  of 
the  loop,  so  we  can  be  sure  that  there  is  a  sequential  execution  of  tbe  program — the  3 
program  does  not  deadlodt. 


SOURCE-tEVEt  OEBUQGmO  OF  AUTOMAT1CAU.>  PARAUELOED  PRQCRAMS 


75 


OMributioii  trwwfonmtiona 


h  the  bqjnpiogtrf  Section  5.4.  we  mmttooediliat  if  a  leceive  to  execuied  before  tbe 
cone^xiiKliag  send  in  a  P  program,  then  a  mpiograin  that  i«  deadlock  &ee  can  deadlock 
while  debugging.  This  is  because  debugging  can  focce  the  (O  program  to  execute  operar 
tioas  in  the  same  order  as  the  P  program,  and  if  a  receive  is  before  its  conespooding 
send,  the  receive  never  completes.  The  extra  s^tcfaronizatioo  that  the  debugger  enfcuces 
causes  the  deadlodc. 

We  use  the  following  example  to  illustrate  how  oommuBicatioo  can  iAcsorrectiy  be 
imened  imp  a  program  to  create  a  negative  time  dependence. 

for  (i  a  0;  i  <  n;  i**) 

b(il  a  a(i]  -«•  %  n] 

If  we  were  to  use  the  iteration  distributioo  of  7  »  Q]  and  o  »  ,  then  the  program 

after  iteration  distribution  would  be; 

dist  a  =  0  to  _np-l  with  <a> 

<0>  for  (  i  a  conip_9lica_bagin(_id,  0, 1,  0)  ; 

i  <  conip_slice_and(_id,n-l,  1,  0)  ; 
i++) 

<0>  b[i]  a  a{ij  a[(i-»-l}  %  n] 

Assume  tbat  the  data  is  distributed  so  that  processor/ has  array  element  a  [i]  andbli). 
According  to  the  iteration  distnbutioa.  processor  i  is  assigned  to  execute  iteration  /.  To 
execute  its  iteration,  the  processor  already  has  •{/]  andb[/]  local,  but  must  fetch 
•  [/■f  1]  ftom  an  adjacent  processoc  We  could  insert  communication  into  the  program 
as  follows: 

dist  a  s  0  to  _np-X  with  <a> 

<0>  for  {  i  =  coinp_slic#_b«gin(_id,  0, 1, 0)  ; 

i  <  con^_slice_end(_id, n,  1,  0) ; 
i++)  { 

<0>  send(  (-1)  ,a[i-«-l] )  ; 

<0>  receive ( { 1 ) , tmp) ; 

<0>  b[i]  s  a(ij  ♦  tnp; 

) 

This  P  program  deadlodcs  becaise  iteration  /  receives  a  value  from  its  odghbor  to  the 
right  Its  right  neighbor  executes  iteration  i  -«■  1 ,  so  iteration  /  +  1  must  execute  the 
eend  statement  before  iteration  i  can  complete  its  receive  statement  When  debug¬ 
ging,  we  must  be  able  to  execute  the  iterations  in  orxier;  if  we  try  to  execute  iteration  / 
without  executing  iteration  /  -•■  1 ,  the  program  deadlocks.  However,  when  »ecuting  the 
(0  without  the  debugger,  the  program  can  complete  without  deadlock.  The  loop  itera- 
tfons  execute  in  parallel  witiioat  any  synctironization  except  whatever  is  forced  by  the 
eand  and  rcecirc  statements,  so  iteration  i  1  can  execute  before  iteraticm  /. 

Tbe  correct  way  to  write  the  p  program  for  this  example  is  to  do  the  communication  in  a 
separate  loop  before  the  computation  loop.  Since  Heratimi  /  is  assigned  to  processor  /, 


7« 


SOURCe-LEVEL  OEBUQCMNO  OP  AUTOdMUCALLY  PARAUJBU2ED  PfiOORAIIS 


Oflti  fJMribulioii 


tiie  oomiMtaiiao  kwp  is  a  teit  to  rigitt  p«u  ovv  tbe  anay.  The  conimtiiicaiiaii  loop 
scads  vahies  fiom  processor  i -!•  1  to  processor  i.  which  is  a  i^bt  to  left  pass  over  ifae 
am^.  Puttiog  the  commuiucaiioo  and  compiAatioo  in  sqaiate  loops  lesoives  the  order¬ 
ing  conflict,  but  might  result  in  a  slightly  less  efiSdeat  pn^pam. 

5A3  Debugging 

Smce  the  oommunicatioo  is  not  visible  in  the  source  program,  the  debugging  functioo 
ignores  communication  statenients  that  have  been  inserted;  when  running,  it  skips  over 
them  by  repeatedly  single  stepping. 

Dcoinmimicatkm(datack,cominand,state,progtam.bpts,naine.  value) 

{ 

if  (command  run)  { 

newstate  apply(fixst(dstack)4est(dstack)pifli,state,program,bptsJ.^) 
if  (stopped  at  a  communicatioo  statement) 
single  step  until  we  reach  a  statement 
that  is  not  communicatitm 
return  newstate 

if  (command  a  where)  { 

return  appiy(first(dstatck)jest(dstack),where..state,  1,.  1  ..I ) 

} 

if  (command  =>  examine)  ( 

return  apply(first(dsiack)4est(dstack),examine,stateJUnameJ.): 

} 

if  (command  »  set)  ( 

return  apply(firat(d5tack)  jest(dstack),set.stateJ  .name,  value); 

) 

} 


5.5  Data  distribution 


When  we  distribute  iterations  of  loops  to  processors,  we  must  also  distribute  the  data 
tbat  those  loop  iterations  access.  Data  distribudons  are  specified  in  the  same  way  hxrp 
iieratitm  distributkms  are  specified;  there  is  a  linear  mrgrping  between  the  elements  of 
multidimenskmai  data  arrays  and  the  mulddimeasioiial  processor  arrays.  Data  am  be 
replicated  by  using  in  the  mapping.  As  before,  we  limit  the  type  of  distributioD  to 
the  case  whm  there  is  at  most  one  non-zero  element  in  every  tow  and  every  column  of 
the  distributioa  mmrix.  In  efEect,  data  arrays  are  decomposed  along  one  or  more  dimen- 
sioo  and  distributed  as  slices  of  the  original  data  array.  Anay  references  will  be  repre¬ 
sented  by  a  vector  of  the  subscripts  for  each  dimension.  For  example,  leferetKe 
•  [1]  (11  is: 


r 


SOURCE-LEVEL  OEBUOCaNQ  OF  AUTOMAHCALLV  FARAlxELQEO  PROORAMS 


77 


Dilributton  ttiwfofmtfcMK 


Equation  SJt 


FKSURE 


llteinqiFiDgofanayelemeiusioiTOcesioriiidtocestsidboticaltodiemappiBgQfitBia- 
tioot  to  piooessor  iodextt.  The  pRKesaor  inckx  of  a  anny  elemett  can  be  cooiputBd 
from  the  anay  index  as  follows: 


p-  Lno+oj 

where  r  is  (be  ttansfonnatioa  matrix  and  0  is  the  oCEret  In  the  exaaq>ie  in  FIGURE  S- 

gets 


9.  a  amy  is  divided  into  2x2  slices  as  detenained  by  the  f  matrix.  Processor 
(he  slice  containing 


i|  n 

Ul  u 

i’  L 

y  L3J 

Raiationahip  batwaan  giobai  ad  ioeai  anay  addrasaaa 


r* 


0  = 


0 

0 


A  global  address  is  ^  set  at  amy  indexes  of  a  anay  elemat  in  the  seqoentiai  pro- 
gram;  a  local  address  is  the  set  of  array  indexes  used  in  one  processor  of  the  parallel 
program.  1b  convert  the  giobai  address  (o  a  local  address,  we  subnaa  a  constam  ofEset 
from  all  addresses  so  that  the  location  of  (be  eiemeat  in  the  slice  with  the 
address,  which  we  shaD  call  tg,  is  translated  to  the  origin.  In  the  example  of  FIGURE  S- 


9,  we  would  subtract 


2 

2 


from  a  anny  reference  on  processor  (1,1)  to  convert  a  global 


address  to  a  local  address.  For  some  types  of  mappings  it  is  necessary  to  locate  at 
some  pomt  other  than  the  origin.  For  example,  we  might  want  to  leave  row  0  unused  so 
that  we  can  put  a  copy  of  a  row  there  that  htt  been  mapped  to  an  adjacent  processor  We 
call  this  offset  r. 


71 


aouiKai4jntgLOgBUQQaiQOPauTOiiATtcAiXYPaR((U.iwamwaocmiMmi 


Ma  dtatribulian 


b  geacnl.  ttae  rebtknship  betweeo  global  and  local  addresses  00  a  processor  p  is  as 
foUows: 

Equalion  5^  I  ^  io(p)  +  s 

where  I  is  cbe  local  address,  g  is  tbe  global  address,  and  ioCp)  is  the  address  of  tbe  anay 
eleaaeiit  witb  tbe  smallest  lexicographic  value  that  is  oo  piocesor  p. 

ig  can  be  computed  from  the  map(ring  matrix  T,  ofEset  o,  and  Equadon  52.  Given  a  p, 
we  want  to  find  the  minium  value  of  t  suchthatp  »  |_r(0  -foj.  The  value  of  {g  is 
found  by  computing  r(p^o)  where  T  is  found  by  inverting  ail  the  non-zero  ele¬ 
ments  of  T,  replacing  all  **s  with  0.  and  then  taking  tbe  transpose.  For  example: 


r  = 

0  0* 
0  0  0 

r  == 

002 

000 

.1/2  0  0. 

POQ 

Note  that  we  cannot  sim{rfy  solve  Equation  S.2  for  i  because  neither  tbe  floor  fimctioa 
nor  r  are  necessarily  invertible.  If  we  substitute  the  value  for  ig  in  Equadoo  52  we 
obtain: 

Equation  5.4  l^g-r(p-o)+s 

5.5.1  Codn  garwration 

When  generating  the  ^  program,  the  indexes  for  references  to  distributed  variables  must 
be  rewritten  to  reflea  their  new  location  in  memory.  For  tbe  body  of  tbe  loop  nest,  we 
wiU  assume  that  all  array  references  to  distributed  variables  will  reference  data  elements 
thatarelocaltotbeprocessor.  If  it  were  necessary  to  access  a  non-local  element,  then 
this  would  have  been  replaced  with  a  reference  to  a  local  variable  during  the  stage  that 
adds  communication  to  a  program. 

S.5.1.1  Exampio 

If  tbe  3  dimensiooal  array  B  were  m^rped  to  a  2  dimensional  processor  array  as  follows: 


SOURCe-UEVCL  OCBUOGBIQ  OF  AUTOMATtCALLY  FARAL1EU2ED  PROGRAHS 


79 


omnPMMOii  ifMifOfwinnii 


\lfe  CM  oae  Equaikm  5.4  to  caavm  tte  gk)M  address  to  a  local  address.  If  we  sutMtkuiB 
die  above  values  into  Equatkn  5.4,  we  obtaiii: 

I  «  g-r  (p-o)  +s 


»  g- 


2xJdll]-2 

3xjd(01 

0 


This  is  used  for  any  reference  to  die  anayB.  The  anay  subscript  BCi]  [j-^k]  (21 
would  be  remapped  to: 


I  =  g- 


2x_idtl] 

3xjd{0]-l 

0 


1 

2x_id[l]-2 

m 

— 

3x_id[01 

.2, 

0 

i-2x_id[l]+2l 
|;  +  *-3x_id(0J 
2 


wbicfa  in  the  program  would  appear  as: 
B[i-a«_l<Hll+21  Cd*lt-3*_ldC01Hai 


The  more  complicaied  addressing  needed  u  reference  distributed  variables  is  not  neces¬ 
sarily  ioeffideni.  Note  dial  tbe  variable  _ld  is  loop  invariant  so  the  entire  offset  is  loop 
invariant  and  need  not  be  computed  inside  the  loop.  Funbetmore.  every  reference  to  the 
same  variable  will  have  the  same  offset,  so  the  ofiEset  must  be  computed  at  most  once  for 
every  distiibuied  variable  before  the  loop  begins. 


SJSJ2  Oubugging 

Distrflxiting  data  does  not  directly  change  the  structure  of  the  program,  thus  the  flow 
control  is  unchanged.  If  the  user  wants  to  examine  or  modify  dma.  the  debugs  must 
determine  which  processor  has  the  data  and  the  local  address  of  the  data  on  that  proces¬ 
sor. 


The  ddmgging  function  for  the  data  distribution  transformation  is  as  follows: 


lOURCB-yBVgLOnUOQBIOOeBUTDIiaTICAIXYeafiAimiiroPWOQIUMB 


Oaladtoiributioii 


{ 


if  (coimnand  >■  nm)  { 

letuni  a|q)ly(fir^dstadc)j(tst(dsiadc)  juMtaie,p(ogiam,bptsJUl) 

} 

if  (command  aiB  where)  { 

return  at>piy(first(datadc)jest(dsiack),wfacre.state>l,l.l) 

) 

if  (commaiidv  examine)  { 
if  (name  is  distributed) 

return  apply(fi^d8tack)  jestfdstack), 

examiire..st«e.i,.global21ocal(naine)J.); 
else  { /*  copy  to  look  at  is  on  same  processor  as  the  cunent  statement  */ 
w  a  apply(fhst(dstack)4BSt(dstack).wfaere.staie.  ,1 .1.1) 
i  a  processor  index  of  current  statement  w 
reuifn  appi:^first(dstadc),rest(dstack).examine,statej^(name,i)  J.); 

1 

) 

if  (command  aa  set)  { 
if  (name  is  distributed) 

return  appiy(firs^dstack)4est(dstack). 

set, state  J..global21ocal(oame),  value); 

else  ( 

fixeach  processor  i  { /*  change  aU  copies  *! 

stare  a  a|^y(fii^d)4tst(d),seuJL,(name4).v); 
leuirn  stare 

) 

} 

) 

The  fiiDctioa  global21ocal  uses  Equation  5.2  to  compute  the  processor  numba-  and 
Equatkn  5.4  to  convert  the  global  address  to  a  local  address. 

S.5.2.1  Exampi* 

A  three  dimeasiotMl  array  A  is  distributed  as  follows  on  a  2  dimensional  processor 
array: 


bd 
I'  »  00 
P2 

The  loen  wants  to  inspect  the  variable  A[l][2][ll].  To  compute  the  processor  index: 


SOURCE-LEVEL  OEBUQQMO  OF  AUTOMATICALLY  PARALLELIZED  PROGRAMS 


DIrtrliutloii  triMrfonmtion* 


P  -  Lr(«)  +0J 


lb*  0l 

'{ 

2 

+ 

0 

lo  0  l/2\ 

11 

1 

W  J 

lb  ccmpute  the  local  address: 

I »  g-r(p-o)  +s 


0 

0 

0 


/  s» 


1 

2 


1 


The  value  can  be  found  on  any  pnxesscr  in  column  6  with  a  local  address  of 
All]  [31(1]. 


5.6  Cyclic  iteration  cBstributions 


A  linear  mapping  model,  as  described  in  the  previous  sections,  does  not  always  perform 
well  when  the  work  is  not  uniformly  distributed  across  iterations.  For  example,  the  work 
performed  by  one  i'xraiioo  of  the  outer  loop  of  the  nest  in  FIGURE  5>2  increases  with 
every  iteradon.  If  outer  loop  iterations  are  distributed  evenly  across  a  one  dimensiooai 
processor  array,  then  the  last  processor  does  much  more  work  than  the  first  processor; 
the  load  balan^g  is  poor.  Any  linear  model  that  parallelizes  the  outer  loop  must  divide 
the  iterations  equally  among  processors. 

One  mappmg  model  that  is  used  to  better  diairibuiB  the  load  for  such  a  case  is  called  a 
cydic  distrflMitioiL  In  this  model,  a  processor  will  execute  everya  iteratkais,  wberea  is 
usually  the  number  of  processors.  On  an  n  processor  array,  processor  p  would  execute 
ineatiaia  p,p+n,p+2n,  ....It  is  also  possiUe  to  cyclically  distribute  blocks  of  iteia- 
dotts  in  the  same  way.  If  the  number  of  iterations  is  sufficiently  larger  than  the  number 
of  processors,  then  erch  processor  will  have  a  nearly  equal  amoimt  of  work  for  triangu- 


aOIJf»S4£VEL  OCBUQOMO  OF  AUTOMATlCmY  PAfUUJEU2£D  PAi>QRAM 


Cyelie  Hwallon  ^Mributioiw 


larproMoBS.  Cyclic  (Usttibatioas  have  disadvaatages.  bowever.  If  iteratioa  n  produces 
a  vafaK  that  is  used  by  itcraiioa  A  1 ,  then  tbe  value  must  be  sou  fiom  one  processor  to 
anotber  for  a  blodc  size  of  1.  If  a  cyclic  disiributioa  were  oot  used,  botb  itetatioas  could 
be  part  of  tbe  same  Node  aod  assigned  to  tbe  same  jHOcessor.  retbicmg  ibe  firequency  of 


CycUc  distribatioo  of  loop  iteiatioiis  usually  retpures  Ibat  we  distribute  tbe  data  arrays 
io  tbe  same  way.  We  will  describe  distributiag  itmatioDs  first  and  data  seoMid. 

8,6.1  Itaration  distribution 

We  can  extend  tbe  description  of  the  mapping  of  iterations  to  processors  to  include 
cyclic  by  rewriting  Equation  5.1  as: 

Equation  5.5  p  s  [_r(0  -t-oj  mode 

where  mod  is  an  elementwise  modulo  operation  and  a  is  a  vector  tbat  contains  tbe  size 
tbe  anay  in  every  dimension. 

Tbe  algoritfam  f<»’  generating  code  and  D  functions  as  presented  previously  will  no 
longer  woric  because  the  modulo  fimetion  makes  the  mapH^"  ’Km-linear.  Instead  of 
adapting  the  code  generation  aadD  functions  to  include  distributions,  we  will  do 
a  transformation  on  tbe  source  i»Dgram  before  generating  the  P  program,  and  then  use 
tbe  standard  code  generatim  algorithm.  Ibe  transfamation  changes  tbe  sb^  of  the 
iteration  ^nce  in  a  manner  tbat  makes  it  possible  to  use  a  linear  nuqrping  to  do  tbe 
cydic  distibution. 

Tbe  user  gives  a  T aod  o  assuming  Equation  S.5  will  be  used  for  the  mapping.  A  trans¬ 
formation  is  applied  to  tbe  loop  nest  to  generate  anotber  loop  nest,  f,  d.andaDfunc- 
tion.  The  code  generation  algoritfam  (Algorithm  5.1)  is  applied  to  tbe  new  loop  nest, 

using  f  and  d  to  specify  the  mapping. 

S.6.1.1  Cod*  ganfution 

We  can  extend  tbe  mapping  model  to  include  tbe  cyclic  distribution  by  applying  a  loop 
transformation  called  strip  mining(Padua  86]  to  the  original  sequential  program.  An 
example  in  FIGURE  5-10  is  used  to  illustrate  this.  Tbe  left  band  side  of  the  figure  is  a 
loop  aod  its  triangular  iteration  space.  Tbe  1  dimension  is  the  vertical  axis  and  the  j 
dimension  is  tbe  axis  coming  out  of  tbe  page.  If  we  apply  strip  mining  to  tbe  outer  loop, 
then  tbat  loop  will  be  replaced  with  two  loops.  Inside  the  body  of  the  loop,  the  value  of 
the  original  loop  counter  can  be  computed  ^  summing  the  two  new  lot^  counters.  The 
new  loop  and  iteration  space  is  cm  the  right  band  side  of  tbe  figure.  The  i  dimension  is 
replaced  with  tbe  eaip  mid  nowl  dimensions,  which  are  horizontal  arul  vertical,  respec¬ 
tively. 

After  strip  mining  is  applied,  a  linear  mapping  can  be  used  to  achieve  the  desired  assign¬ 
ment  of  iterations  to  processors.  In  FIGURE  5-11,  a  mapping  of  r  »  [p  1  and 

0  »  could  be  used  to  achieve  a  cyclic  mapping  of  iterations  to  processors. 


SOUROKEVEL  OOBUOaMO  Of  AUTOMAUCAULY  PARAU.EU2ED  PROffiUlM 


83 


Dtrtribution  tnwfomwthw 


FIGURE  1^10  Triangular  itanrtion  space  that  teehangtd  by  itrip  mining 


Original  loop  and  itoatioa  space  Loop  and  iteraiion  space  after  sti^  mining 

for  (i  3  0;  i  <  6;  i**)  for  (tsip  >  0;  tmp  <6;  tap  -t-s  2) 

for  (j  3  0;  j  <  i;  j++)  ...  for  (nawi  3  0;  nawi  <  2;  nawi++) 

for  (j  3  0;  j  <  i;  j++) 

(i  3  tap-«-  nawi)... 


The  general  procedure  for  tbe  cyclic  disoibutioa  is  as  follows.  Tbe  input  is  a  loop  nest 
where  the  mdi  loop  in  tbe  nest  is  of  the  form: 

for  (k  3  1;  k  <  h;  k  +3  s) 

T  and  0  are  the  parameters  to  the  mapping.  Tbe  mtb  column  of  F  is  the  column  that 
describes  the  mapping  of  loop  m.  Since  this  loop  is  distributed,  it  has  one  non-zero  ele¬ 
ment  c  at  position  r.  A  new  nest  will  be  output  which  is  tbe  same  except  kx^  m  is 
replaoed  with  the  code  below.  Ibe  variable  1^  1,  b,  and  a  refer  to  the  loop  above. 

nawi  3  ( ( ( (c*l+o{m] ) /_np(r3 ) *_np{rl ) -otm] ) /c; 
for  (_t  3  nawi;  t  <  h;  t  +3  _npCr]/c) 
for  (  k  3  Bax(0,l-t); 

k  <  _np(rj/c  -  azue(0,  t+jip[rJ /c  -  h) ; 
k  +3  s) 

Tbe  original  loop  counter  can  be  reconstructed  by  adding  _t  and  k.  Uses  of  the  loop 
ooumer  in  tbe  rest  of  tbe  loop  must  be  replaced  with  _t  *  k. 

The  conqrlicated  loq)  control  arises  from  the  requirement  that  it  be  possible  to  use  a  lin¬ 
ear  mapping  for  the  new  loop  nest  For  a  linear  mapping,  the  iteradons  of  a  loop  may  not 


84 


hOURCE-UEm  OEBUOMMa  OF  AUTOIIAnCALLY  FAfUIXEIJaB)  rRuORAin 


Cyelie  Mmlion  dtetrlMitiofM 


RGURE  5-11 


Mappbig  of  th«  itoration  spae*  after  strip  mining 


‘^mp  anniiur  fnan  tfae  end  (tf  the  jffocessor  anay  to  tbe  beginning;  the  assignment  of 
iteratitMis  to  processors  must  be  strictly  monotonic;  from  low  indexes  to  high  indexes  or 
high  indexes  to  low  indexes.  For  tfae  rest  of  this  paragraph  we  assume  that  mappings 
only  go  fiom  low  indexes  to  high  indexes.  To  ensure  that  wrap  around  does  not  occur, 
we  must  make  sure  that  tbe  iteratioD  of  tfae  inner  loop  wbeiek  •  0  is  always  assigned 
to  the  first  processor  in  that  dimenskm,  which  in  some  cases  requires  padding  by  adding 
extra  iterations  on  to  the  beginning  of  tbe  cniginal  loop.  Tbe  va^le  a*vl  is  the  new 
low  bound  that  has  been  adjusted  for  padding.  However,  we  do  not  want  to  really  exe¬ 
cute  these  extra  itetaiitHis,  so  we  use  tte  max  condition  in  tbe  low  bound  computation  in 
tbe  inner  loop  to  skip  over  those  iterations.  Tbe  awx  condition  on  the  high  bound  of  tfae 
inner  loop  ensures  t^  we  do  not  execute  any  extra  iterations  after  tfae  original  high 
bound. 


The  new  T  matrix  is  tbe  same  as  the  old  one  except  that  an  extra  column  of  zero’s  is 
insetted  into  the  mtb  positmn;  tfae  mth  column  (tbe  new  one)  is  the  mapping  for  tbe  new 
outer  loop  (it  isn’t  distributed)  and  column  m  + 1  determines  tbe  mappii^  of  tbe  inner 
loop.  The  paddmg  of  iterations  makes  an  offset  unnecessary,  so  the  o  vector  is  tbe  same 
except  that  a  0  is  used  for  tfae  processor  dimension  that  the  loop  is  distributed  along. 

S.6.1J!  Oabugging 

When  single  stepping,  the  debugger  must  skip  over  tfae  outer  loop.  Modifying  a  loop 
counter  is  possible  for  this  transfonnation,  but  since  the  iteration  distribution  which  fol¬ 
lows  does  not  permit  it,  there  is  no  reascm  to  do  it  for  this  transfonnation.  If  tbe  user 
examines  tbe  loop  counter,  tfae  debugger  simply  retunis  the  sum  of  tbe  two  new  loop 
counters. 


SOURCe-UEVEL  OEBUGOmO  OF  AUTOMATICALLY  PARALLELLED  PROGRAMS  8S 


DtgWbutkwi  twwIorwttBn* 


SA2  Oatf  iistriiiition 

^k)w  we  wOt  describe  how  to  <kk  cyclk  dtetfa  distributkias.  It  is  vny  siiiiilar  to  cyclic  iter- 
atkn  disiribuiioiis.  We  can  extend  the  description  of  the  mapi^g  of  data  elements  to 
processors  by  rewriting  Equation  52  ax 

Equation  S.S  p  >  [^ro*) -t-oj  moda 

wberemod  is  an  ekmentwise  modulo  openiioo  and  a  is  a  vector  that  contains  the  size 
of  the  anay  is  every  dimensioii. 

bitbesameway  that  we  do  a  cyclic  (fistributian  of  loop  iterations  by  tuming  a  a  deep 
loop  nest  into  a  a  +  1  deep  loop  nest,  we  will  distribute  data  by  turning  a  a  dimensional 
rfata  array  into  a  a  + 1  aifni»n«n«iai  data  array  before  we  do  the  distribution.  An  exam- 
{rie  is  shown  in  FIGURE  5-12.  The  one  dimensional  array  A  with  8  elements  has  been 
rfMMigt-d  into  a  2  dimensional  array  that  is  4  rows  by  2  columns.  Some  of  the  mappings 
from  source  elements  to  target  elements  are  denoted  by  arrows  fipom  one  otqect  to 
another. 


FIGURE  5-12 


Converting  a  1  cfimensionat  array  into  a  2  dinMnsional  array  in  preparation  for  a  cyclic 
dtetribution. 


S.S.2.1  Code  generation 

All  references  to  a  distributed  data  array  must  be  rewrioen  to  reSea  the  new  structure. 
The  general  procedure  is  as  follows.  The  input  is  an  array  reference  where  the /nth  index 
in  the  reference  is  ctf  the  form;  [e] .  T  and  o  are  the  parameters  to  the  mapping.  The 
/nth  column  of  T  describes  the  mapping  of  index  t  in  the  reference.  Since  this  dimen¬ 
sion  will  be  distributed,  it  has  one  non-zero  elemmttc  at  position  r.  A  new  reference 
will  be  output  which  is  the  same  except  sifescript  m  is  replaced  with  the  two  subscripts 
below: 

I (c*e+oim] ) /_np[r] ] [•-(c*«+oIm] ) /_np[r] ] 

The  new  f  matrix  is  the  same  as  the  oM  one  except  that  an  extra  column  of  zero’s  is 
inserted  into  the  mth  position;  the  /mh  column  (the  new  one)  is  the  mapping  for  the  new 
first  index  and  column  m-¥\  determines  the  mapping  of  the  new  second  index.  The 


aOURCE-LEVEL  OEBUQGWKI  OF  AUTOMATICAaY  PARALLEUZB)  PROGRAMS 


Summary 


l>addbig<tfdaaeleinenaBiakBsaB<^BEKtuiii>eccssary  sotlie  0  vector  is  tbesaow 
except  tiMtaO  is  used  for  the  processor  dimension  that  the  array  dimensioo  is  distrib- 
uied  along.  Tbe  s  vector,  wbidi  is  used  to  compute  the  kjcal  addiess,  is  unchauged. 

The  addiess  computatioo  appean  to  be  much  more  complicated  than  tbe  ariginal  refer- 
enoe,  but  this  win  not  necessarily  result  in  less  efilcient  code.  Amy  references  in  loops 
like  these  ate  usually  linear  fuiiAkias  of  loop  counters,  which  can  be  gmieraied  effi* 
ciendy.  Tbe  new  indexes  are  linear  fimctioos  of  the  old  index  so  tbe  new  indexes  wiU 
sdU  be  linear  functions  of  loop  counters  if  the  original  index  was  one.  hi  (his  case,  it 
might  be  more  expensive  to  initiaie  the  loop,  but  the  cost  per  Ueiaiioo  to  compute  array 
references  need  not  increase. 

S.6.2,2  Debugging 

The  D  function  for  this  transfonnadon  is  straightforward;  tbe  control  flow  is  not  altered, 
only  amy  references  are  chan^d.  When  tbe  user  uses  tbe  ddragger  to  examine  or  mod¬ 
ify  a  variable,  it  must  map  the  source  reference  into  a  target  refermce  using  the  proce¬ 
dure  described  in  tbe  previous  section. 

S.7  Summary 


This  chapter  introduces  a  notation  for  describing  the  translatkm  from  sequential  to  par¬ 
allel  Tbm  are  three  parts,  the  mapping  of  itmatkms  to  processors,  tbe  inter-processor 
communicatioo,  and  tbe  mapping  of  data  to  processors.  The  description  coutains  the 
infofmation  that  tbe  ddxigger  needs  to  generate  the  D  function  for  the  sequential  tiaos- 
fonnation  and  tbe  p  program.  Ibe  notation  is  general  enough  to  do  the  block  and  cyclic 
distributions  of  data  and  iterations  that  most  parallelizing  compilers  employ. 


SOURCC-UEVEl  DEBUQGMQ  OF  AUTOMAT1CAU.Y  PARALLEUZB)  PfXXWAMS 


87 


CHAPTER  6 


A  debugger  for  a  compiler 
with  block  and  cyclic 
distributions 


Tbe  previous  two  cbapters  describe  the  components  for  tbe  compiler  and  debugger  for 
the  tliiead  splitting  and  distributioo  transformations.  In  tbis  chapter,  we  integrate  those 
components  and  demon^rale  bow  they  work  together  for  Mock  and  cyclic  distributioiis. 

We  first  introduce  a  matrix  multiply  program  in  Section  6.1,  which  is  used  as  an  exam¬ 
ple  throughout  tbe  chapter.  In  Sectkm  62,  we  use  matrix  multipty  to  show  a  complete 
tianslatioa  Grom  a  to  co  for  Mock  and  cyclic  disiribinioiis.  In  Section  6  J.  we  describe 
the  information  that  is  passed  from  tbe  compiler  to  the  debugger.  We  then  show  bow 
that  information  is  used  by  describing  some  of  the  basic  debugging  operations  in  Sec¬ 
tion  6.4.  We  conclude  by  examining  some  performance  issues  related  to  debugging  in 
Section  6.S. 


6.1  Matrix  multiply _ 

Hk  code  for  matrix  multi(fiy  is  depicted  below.  To  simplify  tbe  code,  we  tkm’t  include 
any  initiaiiaation,  we  assume  that  tbe  A.  B,  and  c  arrays  are  properly  initialized  else¬ 
where.  The  program  is  a  triple  nested  loop,  each  loop  will  be  called  by  the  name  of  tbe 
loop  counter  k,  1,  and  j.  The  1  and  j  loops  have  no  dependences,  so  they  can  be  exe¬ 
cute  in  parallel,  but  tbe  k  loop  has  a  dependence. 

11;  for  (k  3  0;k  <  d;k++) 

12;  for  (i  3  0;i  <  ci;i++) 

13;  for  (j  3  0;j  <  dfj++) 

14;  Cliltj]  Ati][lt]*B[kl[jl; 

A  compiler  or  user  chooses  tbe  distribution  of  computation  and  dma  to  minimize  com- 
municatkm  while  distributing  work.  The  distribmion  determines  what  communication  is 


SOimCE4.EVEL  OEBUQOWQ  OF  AUTOMAUCALLY  FARALLEU2EO  PnOGRAMS  80 


A  dtiMiggar  for  •  Minpiiar  vrflh  Mock  and  eyeUe  dtatribiilioiM 


neoeasaiy.  There  are  ouny  ways  to  dKXMC  ID  paoDelize  nutik  mak^y;  we  ooly 
describe  ope  way.  The  paralldl7aric«dedsk»  is  driven  by  the  way  die  loops  access 
datSL  HGURE  d*!  iOnatrates  how  nairix  multiply  accesses  its  data  for  the  k«l  iteiatk» 
of  the  outer  loop  and  dK  fkst  few  tteredans  of  the  1  loop.  Iteratioii  m  of  the  i  loop  only 
accesses  row  m  of  the  anays  c  and  X  If  we  distribute  tte  1  loop  aaoas  processors  and 
<Bvide  C  and  A  by  raws  ao  that  fowm  it  on  the  same  processor  that  executes  Uoaiioom 
of  the  t  loopi  then  we  never  need  to  move  the  C  or  A  arrays  between  processors.  Itera¬ 
tion  a  (tf  the  k  loop  accesses  raw  a  of  the  B  array,  so  every  time  the  program  begins  the 
1  loop  every  processor  needs  a  copy  of  the  same  raw  of  the  ■  array.  If  the  B  array  is 

(fivkU  by  rows,  then  the  processor  that  has  row  A  must  broadcast  it  before  begituiing 
the  i  loop. 


FIGURE  M  Data  aceaaaea  o(  matrix  muMpty 


CAB 


C 


A 


B 


b-l.i-1 


B 


Ws  start  by  describiog  the  pataUeltzarioa  that  uses  a  block  distribudon.  The  A,  B,  and  c 
arrays  are  distributed  by  row.  Each  processor  gets  2  rows  of  each  array  (processor  m 
gets  row  2ffl  and  2m*  1).  All  processors  execute  every  iteradon  of  the  k  loop.  For  each 
iteration  of  the  k  loop,  processorm  executes  iteradoos  2m  and  2m  4- 1  of  the  ikmp. 


SOURCE-LEVEL  OEBUCMBNO  OF  AUTOMATICALLY  FARAUfiOTO  FROG.  JUM 


ItabtcmuMply 


Since  tbe  ittndoia  and  rows  of  A  and  c  are  dlstriboted  die  same  way,  no  communica- 
tkn  is  needed  for  tliese  anays.  Before  a  processor  begins  executing  iterationm  (tf  tbe  1 
loop,  it  most  fotcb  row  a  of  tbe  a  array,  where  a  is  the  cuneni  value  of  the  loop  counter 
k.  Each  processor  executes  ail  the  itetatxxis  of  the  j  loop. 

To  tnoiivaie  the  steps  in  the  translatioo,  we  first  we  show  the  final  parallel  code;  in  the 
next  secdon  we  go  through  the  stq»  of  generatiiv  it. 


for  (k  »  0;k  <  d;k++)  { 
for  (t  a  0;  t  <  d;  t-*-*-) 

beast {_idaak/2,b(k-2*_id]  It]  ,tinp{t] )  ; 
for  (i  a  2*_id;  i  <  2*_id+l;i++) 
for  {j  a  0;j  <  n;j++) 

CCi>2*_id]  tj]  +=  A[i-2*_id]  [k]*tinptj]  ; 


1 


The  beast  primidve  implements  a  broadcast  for  a  single  elemenL  The  first  argument  is 
true  if  the  processors  should  broadcast  the  value,  or  false  if  the  processor  is  a  recipient 
of  the  broadcast.  The  second  argument  is  the  value  to  be  broadt^  This  value  is 
ignored  on  the  processors  that  ate  receiving  the  broadcast  value.  The  third  argument  is 
the  location  that  should  receive  the  broadcast  value.  It  is  ignored  on  tbe  processor  that 
sends  the  broadcast  value.  The  variable  _ld  contains  the  processor  index. 

The  cyclic  distributioo  is  shnilar  to  the  block  distribution.  The  A,  B,  and  c  arrays  are 
again  distributed  by  row.  The  difference  is  in  tbe  tows  and  iteratkms  that  are  assigned  to 
each  processor.  Ea^  processor  gets  2  rows  of  each  array  (processor  m  gets  row  m  and 
m  4-  p,  where  p  is  the  number  of  processors).  All  processors  execute  every  iteration  of 
the  k  loop.  For  each  iteradoo  of  tbe  k  loop,  processor  m  executes  iteradoom  and  m+p 
ofthe  1  loop.  Since  the  iterations  and  rows  ofA  and  Care  distributed  tbe  same  way,  no 
communicadon  is  needed  for  these  arrays.  Before  a  processor  begins  executing  iteradon 
m  of  tbe  1  loop,  it  must  fetch  row  n  of  the  B  array,  where  n  is  the  current  value  of  the 
loop  counter  k.  Each  processor  executes  all  tbe  iteratkms  of  the  j  loop. 

The  final  parallel  code  is  as  follows; 


for  (k  a  0;k  <  d;k++)  { 
for  (  t  =  0;  t  <  d;  t++) 

beast  (_id==k%_np,blk%_np-_id]  (t]  ,tinp[t] )  ; 
for  (t  =  0;  t  <  d;  t  +a  _np) 
for  (i  a  0;  i  <  ^np;  i++) 
for  (j  a  0;j  < 

C[t/_np)  (i]  I  j)  >a  Alt/_np]  [i]  [k]*tinp(j] 


) 


The  variable  _ap  contains  die  number  of  processors.  The  difference  between  the  two 
programs  is  in  the  way  local  addresses  are  computed  and  tbe  bounds  for  the  1  loop. 


SOUnCC-LEVEL  OEBUQGMQ  OF  AUTOIMtlCAU.Y  PARALLELIZED  PROORAMS 


91 


A  cltb<jqg<r  lof  «oiwpB»r  wWi  block  and  ofdkt  cHttributfow 


6^  Compilation _ 

Now  we  sbow  tbe  oompilatioa  pnxxss  for  nuurix  multiply  using  Mock  and  cyclic  distri- 
buiioiis.  The  steps  are  itetatioa  distributkn,  insefting  comnwmicatiop,  data  dbtributkw, 
and  thread  spliitii^.  The  resulting  program  is  compiled  by  a  single  cell  cooqnkr  wbicb 
generates  the  executable. 

6.2.1  Block  dtoWbutton 

Wb  start  with  the  Mode  distribudon.  The  diatribinioo  is  specified  by  the  mapping  iter* 
atiaat  to  processors,  the  mapping  of  data  to  processors,  and  tbe  communication.  We 
start  with  the  mapping  of  iteratioiB  to  processors.  W:  want  to  give  each  processor  2  iter- 

adoos  of  the  1  loop,  so  tbe  iteration  distributioo  is:  {p  2  ^  {^.  When  we 

apply  this  distributiao  to  the  program  by  using  Algorithm  S.l,  we  obtain  the  following 
program: 


rap  a  a  0  to  _np-l  with  <a> 

<0>  for  {k  =  0;k  <  d;k++) 

diat  b  a  0  to  _np-l  with  <b> 

<0>  for  (  i  =  coinp_slico_b«gin(_id,  0, 2,  0)  ; 

i  <a  cotnp_slic®_end{_id,d,2, 0)  ; 
i-«>+) 

<0>  for  (j  =  0;j  <  d.-j-f+l 

<0>  C[iJfjI+»A(i][kJ*Brk]fj]; 

In  the  next  phase,  the  compiler  inserts  communication.  As  was  described  in  the  previous 
section,  for  iteration  n  of  tbe  k  loop,  tbe  processor  that  has  row  n  of  the  B  array  must 
broadcast  that  row  to  all  processors  before  beginning  the  i  loop.  For  this  reason,  the 
compiler  inserts  a  broadcast  before  loop  L  The  references  to  tte  array  B,  must  be  to  tbe 
local  copy  received,  so  tbe  compiler  also  replaces  references  to  B  [k]  ( j  1  in  the  lo(^ 
with  tap  (j  ] .  After  communication  is  inserted,  we  have  tbe  following  program: 

rep  a  s  0  to  ^np-1  with  <a> 
for  (k  =  0;k  <  d;k++)  { 

rap  a  3  0  to  _np-l  with  <a> 
for  (t  s  0;  t  <  d;  t++) 

rap  a  3  0  to  _np-l  with  <a> 
beast (2*_idssk, Btk] [t] , tmpCt] ) ; 
dist  b  3  0  to  _np-l  with  <b> 

for  (  i  3  coinp_slica_bagin{_id,  0, 2, 0)  ; 
i  <3  coiHp_slica_and{_id,d,  2,  0)  ; 
i++) 

for  (j  3  0;j  <  d;j++) 

C[i)tj3  +a  A(i]  [k]  *tinp[  j]  ; 


Next,  the  compiier  converts  global  addresses  to  local  addresses.  Each  of  the  arrays  are 
(fistribiiiBd  so  (hat  each  processor  gets  2  rows.  Tbe  mapping  is  specified  by: 


<0> 

<0> 

<0> 

<0> 

<0> 

<0> 


B0URCe-LEV6L  OEBMiOma  OF  AUTOMATiaUi.Y  P/IIUUaJ2ED  FnOQRAMS 


A.B,C' r«  {2  j  »  ® 


Tile  GompUer  uses  the  hbove  specificatiop  aod  Equatkm  5.4  to  compute  the  mapping 
between  ghihel  addmaes  tmd  local  addresaes  »  follows.  Anays  A.  B,  and  c  all  have  the 
sane  mapping,  so  we  only  need  to  show  the  global  to  local  mapping  once. 


r-  [20] 


P^iji 


I  =  g-r(p-o) 


-fr] 


In  the  code  we  must  subtract  a*_id  firom  the  first  index  of  a  global  address.  This  gives 
us  the  following  program: 


1  r«p  a  =  0  to  ^np-1  with  <a> 

2  <0>  for  (k  =  0;k  <  d;k++)  ( 

3  rep  a  =  0  to  _np-l  with  <a> 

4  <0>  for  (t  a  0;  t  <  d;  t++) 

5  rep  a  a  0  to  _np-l  with  <a> 

6  <0>  bcast(2*_id==k,Btk-2*_id] [t] ,tmp[t] ) ; 

7  dist  b  a  0  to  _np-l  with  <b> 

8  <0>  for  (  i  a  coinp_slice_begin(_id,  0, 2, 0)  ; 

9  i  <a  coinp_slice_end(_id,  d,  2 , 0)  ; 

10  i>+) 

11  <0>  for  [  j  =  0;j  <  d;j++) 

12  <0>  C[i-2*_id] [j]  +a 

13  A[i-2*_id]  tk]*tnip[j]  ; 

14  } 


In  the  next  step,  the  compiler  applies  thread  splitting,  which  removes  the  PIUs  and  the 
rep  and  dlat  constructs  to  yield: 


aOUROE-UEVEI.  OeaUGQMa  of  AUTOMAIKAIXY  PARAiLEUZED  PROGRAIM 


A  dabuggir  fwr  ■  coiifNter  iriUi  Moek  and  eyetio  dtalributioiM 


1 

2  for  (k  *  Ojk  <  d;k++)  ( 

3 

4  for  (t  a  0;  t  <  d;  t++) 

5 

6  beast (2*_ida»k,B[k-2*_idl [t] ,tfflp{t] ) ; 

7 

8  for  (  i  a  conp_allcs_b«gin(_id, 0,2, 0) ,nia0; 

9  i  <a  cotBp_slica_snd(_id,d,2,  0}  ; 

10 

11  for  (j  a  0;j  <  d;j++) 

12  Cti-2*_id] tj] 

13  A(i-2*_id]  Ik]*tinp(j]  ; 

14  ) 

Recall  tbat  the  ttuead  spUtting  debugger  we  presented  in  tbe  pievious  chapter  assumes 
that  k)ops  are  nonnalized  (start  at  0  and  count  by  1).  In  this  example,  instead  of  normal¬ 
izing  Che  1  loop,  we  introduce  a  new  variable  called  al  (normalized  i)  tbat  starts  at  0 
andcounaby  1  in  tbe  1  loop.  Whenever  we  need  tbe  loop  count  for  the  i  loop,  we  ref¬ 
erence  this  vrsiaUe  instead  of  the  loop  coimter  1. 

This  program  is  then  compiled  by  a  single  processor  compiler  that  generates  an  exr  cut- 
able  which  can  be  loaded  into  every  processor  of  the  parallel  machine. 


6.2,2  Cyclic  distribution 

We  watK  tt>  do  a  cyclic  distributioa  for  the  1  loop  and  the  A  and  C  arrays.  Tbe  compiler 
must  strip  mine  tbe  loop  and  tbe  arrays  before  applying  tbe  basic  distrilmtioo.  After  strip 
mining,  tbe  code  is  as  bated  below.  ^  have  shnpUfied  array  index  and  loop  irxiex 
expressioas  wbea  possible. 

for  (k  a  0;k  <  d;k++) 

for  (^t  =0;  t  <  d;  C  _np) 
for  (i  a  0;i  <  _np;i++) 
for  (j  =  0;j  <  d;j++) 

C(_t/_npI(iI(jJ  += 

A(^t/_npltil[kl*B(k][j]; 


After  strip  mining,  tbe  distributiaa  for  tbe  loops  are;  T  -  [o  0  1  =  (^.Thedis- 

oibutioQfQrtbeAand  c  arrays  are  identical:  7'’=^(pio].o=:|^,r3  ^  .The 
distrbmtioa  for  the  B  array  is  tbe  same  as  in  the  previous  example:  r  »  [2  . 


Next,  tbe  onnpiler  applies  Algorithm  5.1,  to  obtain  tbe  following  program: 


94 


atx«k»J.EVtiLDHaioiMiQosauTQisaTiriaiLVPaPAiimi7mirinonAia» 


rap  a  a  0  to  .AP**!  with  <a> 

<0>  fcr  (k  a  0;k  <  d;k++) 

rap  a  a  0  to  _np-l  with  <a> 

<0>  for  (^t  a  0;  t  <  d;  t  -fa  _np) 

dist  b  a  0  to  _np-l  with  <b> 

<0>  for  (  i  a  coiqp_alica^bagin{_id, 0, 1, 0) ; 

i  <a  coopts  1  ica^and ( _id,^np,  1,  0)  ; 

i++) 

<0>  for  (j  a  0;j  <  d;j++) 

<0>  Ct_t/_np] [i] [j]  +a 

AUt/_np]  til  Ck]*Btk]  [j]; 

In  tbe  next  pbase,  tbe  compiler  inserts  commimication.  The  broadcast  occurs  before  the 
t  loop  executes.  In  the  source  program,  tbe  compiler  decides  to  insert  tbe  broadcast 
after  tbe  k  loop.  Tbe  lefeiettces  to  tbe  array  a,  must  be  to  tbe  local  copy  received,  so  tbe 
compiler  also  replaces  ieferer:xs  to  b{k]  Ij]  in  tbe  loop  with  trap  [j].  After  commu- 
ntcaiioa  is  inserted,  we  have  tbe  following  program: 

rap  a  a  0  to  ,>np-l  with  <a> 

<0>  for  (k  a  0;k  <  d;k++)  ( 

rap  a  a  0  to  _np-l  with  <a> 

<0>  for  (t  a  0;  t  <  d;  t++) 

rep  a  a  0  to  _np-l  with  <a> 

<0>  bcast(_id%_np3ak,Blkl {tl ,tmplt] ) ; 

rep  a  a  0  to  _np-l  with  <a> 

<0>  for  (_t  a  0;  t  <  d;  t  ■►x  _np) 

diet  b  a  0  to  _np-l  with  <b> 

<0>  for  (  i  =  conp^3lice_begin(_id, 0, 1, 0) ; 

i  <a  compos 1 icB_end ( _id, _np, 1, 0) ; 

i++) 

<0>  for  (j  a  0;j  <  d;j++) 

<0>  C[_t/_npHi]  [ j]  += 

A(_t/_np] [i] (k]*B[k] [j]; 

) 

Next,  tbe  compiler  convetts  global  addresses  to  local  addresses.  Array  B  bas  tbe  same 
mapping  as  before,  while  A  and  c  have  a  different  mapping,  so  we  only  need  to  show 
the  global  to  local  mapping  once. 


SOURCCAJEVELOEBUOOMQOF  AUT<HIATICAU.Y  PARALLEUZED  PROGRAUS 


95 


A  dabuggar  for  a  eoinpilw  with  Mock  Mid  eyelie  ifairtHiilom 


r-[oi^  1- 


p  -  W 


i  *  g-r(p-o)  +j 


i- 


+ 


0 

0 


=  g- 


Ih  the  code  we  must  subtract  _i4  from  the  second  index  of  a  global  address  This  gives 
us  the  following  program: 


1 

rap  a  s  0  to  ^np-1  with  <a> 

2 

<0> 

for  (k  a  0;k  <  d;k-*'+)  { 

3 

rap  a  3  0  to  _np-l  with  <a> 

4 

<0> 

for  (t  3  0;  t  <  d;  t+-f) 

5 

rap  a  3  0  to  _np”l  with  <a> 

6 

<0> 

beast  (_id%_npssk,B[k]  [t-2*_id] ,  tinp[t] ) ; 

7 

rep  a  3  0  to  _np-l  with  <a> 

8 

<0> 

for  {_t  3  0;  t  <  d;  t  +3  _np) 

9 

dist  b  3  0  to  _np-l  with  <b> 

10 

A 

O 

V 

for  (  i  3  caiBp_slica_bagin(_id,  0, 1, 0)  ; 

11 

i  <3  comp_slica_end(_id,„np, 1, 0) ; 

12 

i+>) 

13 

A 

o 

V 

for  (j  3  0?j  <  d;j++) 

14 

A 

O 

V 

C[_t/_npl ti-_id] {j]  +3 

15 

At_t/_np]  [i-_idj  [k]*tii?)(  j]  ; 

16 

) 

In  the  next  step,  the  compiler  applies  thread  splitting,  which  removes  the  POL’s  and  the 
raip  and  diet  coostnicts  to  yield: 


SOURCS4JEVEL06aUOQ»K»OFAUTOIiATICAU.YP<>BMI  WIIFD  PWOQftAMS 


1 

2  for  (k  »  0;k  <  d;k++)  { 

3 

4  for  (t  =  0;  t  <  d;  t**) 

5 

6  beast  (_id%jQp«ak,B[k]  {t-2*_id]  .taiptt} )  ; 

7 

8  for  (_t  a  0;  t  <  d;  t  -fs  _np) 

9 

XO  for  (  i  a  c«tp_slics_bayin(_id, 0, 1, 0) , ni=0; 

11  i  <a  coBip_slics_end{_id,_np,  1,  0)  j 

12  i++,ni+'*>) 

13  for  (j  a  0;j  <  d;j++) 

14  CC_t/„nplti-_idUj]  +* 

15  AUt/^p]  [i-_id]  [k]*tn?>[j]  ; 

16  } 

This  program  is  then  compiled  by  a  single  processor  compiler  that  generates  an  execut¬ 
able  which  can  be  loaded  into  every  processor  of  the  parallel  machine.  Instead  of  nor¬ 
malizing  the  1  loop,  we  introduce  a  new  variable,  &1,  that  has  the  normalized  count 


6.3  Compller/debuggef  Interface _ 

In  this  section,  we  describe  die  information  that  must  be  passed  from  the  compiler  to  the 
debugger.  We  first  list  the  information  necessary  for  distribution  in  Section  63.1  and  the 
infrnmatioo  for  thread  splitting  in  Section  6.3.2.  Wb  use  the  matrix  multiply  examples 
for  both. 

6.3.1  Distribution 

Examples  of  the  information  that  the  compiler  must  pass  to  the  debugger  fn  the  distri¬ 
bution  phase  of  compilation  can  be  found  in  FIGURE  6-2.  Tbe  left  hand  side  is  tbe 
infonnation  fur  the  cyclic  distribution  and  the  right  band  side  is  for  the  block  distribu¬ 
tion.  The  rest  of  the  section  explains  what  information  is  needed  for  each  part  of  tbe  dis¬ 
tribution  transformation. 

If  the  compiler  uses  a  cyclic  distributitm  for  iterations  or  data,  it  strip  mines  loops  or 
arrays  first  Tbe  infonnation  that  tbe  D  function  needs  is  the  position  of  the  loop  in  tbe 
prognm,  and  the  name  of  the  loop  counters  for  tbe  inner  and  outer  loops.  If  any  vari¬ 
ables  have  been  strip  mined,  the  compiler  passes  the  debugger  the  name  of  the  variable 
and  the  dimension  that  has  been  split 

For  iteration  distribution,  the  compiler  must  pass  tbe  debugger  the  line  number  of  each 
distributed  loqi,  the  name  of  the  iot^  counter,  the  expression  for  the  low  bound,  and  the 
expression  for  the  high  bound.  It  must  also  pass  the  line  numbers  of  any  statements  that 
it  inserted.  In  this  example,  neidier  examples  have  inserted  lines. 

The/>  ftmetion  for  communicatiem  single  steps  over  code  that  has  been  inserted  for 
communication;  the  compiler  passes  these  Um  numbers  to  the  debugger. 


SOURCC-LEVEL  OEBUQGMQ  OF  AUTOMAnCALLY  PARALLEUZET  PROGRAMS 


97 


A  dibuggw  for  a  conipflar  wWi  btoek  and  eyelie  dtairlMtlioiMi 


nCIfflE  M 


Informatbn  paaaad  from  tha  cotnpilar  to  tha  dabuggar  for  diatriHition 


Cyclic  distribution 


laopstripi 

looppoatioii  ionerloop  outerioc? 


8 


data  strip  mina 

variable  name  dimension 


loop  position 


iteratioa  dlstrib'^ition 

counter  low  bound  high  bound 


10 


0 


Block  distribution 


loqp  position 


iteratioa  distributioa 

counter  low  bound  bigb  bound 


8 


inserted  conunimkatioa 
line  number 

4 


data  distrilMitioa 

variable  name  T  o  s 


A 

B 

C 


[o  1  o]  0 

[2  0]  0 

[o  1  o]  0 


0 

0 

0 

0 

0 

'o 

0 

lO 


inserted  oomnmnicatioa 

line  number 
4 

data  dlstributfcm 

variable  name  T  o  s 
A 

B 

C 


[2i3  0 

[2  0]  i 

[2  0]  @ 


SOURCE4JEVEL  OeBUQQMa  OF  AUTOMATKALLY  PARAgjEU2ED  PROGRAMS 


ComnJlMldttlbiMMiw  fnteffacA 


The  D  ftaictiOB  for  data  distribiitioB  changes  globol  names  to  load  names  for  sU  user 
commands  to  examine  and  modify  variables.  If  the  dam  is  distributed,  it  uses  Equadoa 
52  tt)  compute  the  processor  index  and  Equadoa  5.4  to  convert  the  global  addtess  to  a 
local  address.  For  noo>distributed  varidbies,  it  ^amiiw^  the  copy  of  the  variable  on  the 
processor  at  the  currestlocatioo.  If  the  user  modifies  a  variable,  it  modifies  all  copies. 
The  compOer  must  pass  to  the  ddwgger  the  names  of  distributed  vari^>le3  and  the  auq>- 
ping. 

6.3.2  ThroaKf  splittiiig 

Example  of  the  information  that  the  compiler  must  pass  to  the  debugga  for  the  ttnead 
splitting  phase  can  be  found  in  FIGURE  6-3.  All  of  the  infoimatioa  that  the  compiler 
must  pass  the  debugger  about  thread  splitting  is  concerned  with  computing  the  virtual 
time  and  the  set  of  early  opetatioos. 

In  the  table,  there  is  one  line  of  informafion  for  every  statement  of  the  program.  The  vir¬ 
tual  thne  template  has  been  described  previously.  R^  and  write  are  the  set  of  variables 
that  ate  read  or  written  by  statement  of  the  program.  The  line  also  contains  the  r«p  and 
dlae  statements  contained  in  each  statement  of  the  program. 

By  plugging  in  the  values  for  the  dint  variable  and  the  loop  counters,  the  debugger  can 
compute  the  virtual  time  of  a  statement  The  value  of  loop  variables  is  taken  from  the 
program  state,  and  the  value  of  the  dint  vaiiaUes  for  a  processor  index  can  be  com¬ 
puted  from  the  r«p  and  dlat  statements  and  PIL’s  in  a  program.  The  virtual  time  tem¬ 
plate  can  also  be  used  to  compute  the  set  of  early  operations  for  a  state  of  the  program. 
With  the  set  of  early  operadcms.  together  with  tte  read  and  write  informatxN)  for  every 
statement,  the  debugger  can  compute  if  a  program  bas  early  reads  or  early  writes. 


6.4  Debugger  _ 

This  section  explains  bow  the  D  functioos  for  distribudon  and  thread  splitting  work 
together  to  form  an  entire  debugger.  We  stan  oy  giving  an  example  of  bow  virtual  dme 
is  computed  in  Section  6.4.1.  This  is  followed  by  a  description  of  bow  the  debugger 
commands  are  implemented  assuming  a  block  or  cyclic  di^buti'm  in  Section  6.4.2.  We 
then  explain  bow  the  basic  mechanisms  for  the  thread  splitting  debugger,  roll  forward 
and  cOQsisti,.it,  breaiqminis  are  affected  by  using  a  block  or  cyclic  distribution  in  Sec¬ 
tion  6.4.3  and  Section  6.4.4. 

6.4.1  Computing  virtual  6ma 

Computing  the  virtual  time  of  a  statement  is  central  to  the  ability  to  dmennine  the  cur¬ 
rent  location  and  decide  if  user  commands  are  disallowed.  To  compute  the  virtual  time 
of  a  statement,  v.  e  need  two  things:  the  position  in  the  program,  whidi  determines  the 
virtual  time  template,  and  the  value  of  the  parameters  in  the  template.  The  parameters 
are  the  value  of  dlst  variables  and  the  value  at  loop  counters.  The  value  of  dlat  vari¬ 
ables  can  be  computed  statically  for  each  statement  finom  the  processor  imlex.  The  value 
of  the  counters  are  deter  ined  atruu-time. 


SOURCE-LEVEL  DEBUQOMO  OF  AUTOaUHCAUV  PAIULLELJ2ED  PROGRAMS 


96 


A  dabuggar  for  a  compiler  wMh  Mo«k  and  eycHio  dMrlwtlorai 


H6URE  •>3  InfomialionpaaaadfromthaconiptiartothadabuggarforthraadapStling 


Block  distrlbutioa 


Uoeiiiimber 

vinual  time  tcmplaiB 

lemk 

writes 

PIL 

lep  and  dist  satemoMs 

1 

rap  a  *  0  to  _np(0]-l  with  <a> 

2 

3 

(24ao,oAo,ao) 

k.d 

k 

<0> 

rap  a  a  0  to  _np[0]-l  with  <a> 

4 

5 

(2Jc.4,t.4.0.0,0.0) 

t.d 

t 

<0> 

rep  a  a  0  to  _np[0]-l  with  <a> 

6 

7 

(2Jc.44.6,0.0.0,0) 

k,t,anp3 

onpA 

<0> 

dist  b  a  0  to  _np-l  with  <b>  { 

8 

a 

ak.7.b^4U.8.0.0) 

i4 

i 

<0> 

7 

10 

n 

(2Ji;7.bAni,llJ.ll) 

14 

j 

1 

12 

13 

14 

(2A7,b,8,iii.llj.l2) 

ijAA.tmp,C 

C 

<0> 

) 

Cyclic  di^bution 


line  number  virtual  dine  template 

reads 

writes 

PIL 

lep  and  dist  statements 

1 

rap  a  a  0  to  _np[0]-l  with  <a> 

2 

(2A2.0A0,0.0.0,0,0) 

k.d 

k 

<fl> 

3 

rep  a  a  0  to  _nptO]-l  with  <a> 

4 

ak.4,t.4,0,0,0.0.0,0) 

U 

t 

<0> 

5 

rep  a  a  0  to  _npfO)-l  with  <a> 

6 

(2X4,U6,0,0.0,0,0,0) 

k,t.nnp3 

tmp3 

<0> 

7 

rep  a  a  0  to  _nptO]-l  with  <a> 

8 

ak.8^t.8,0.0,0.0.0,0) 

-Cd 

_t 

<0a. 

9 

dist  b  a  0  to  _np-l  with  <b>  { 

10 

ak,8^t,9.b,10.iii,10,0) 

i4 

i 

<0o 

11 

12 

13 

(2j£.8^t,9.b,10,iii,13.j.l3) 

j.d 

j 

<0> 

14 

CJi;8^t.9,b.lO,ni,13.j,14)  CL.t,i,j,A,tmp 

C 

<0> 

15 

} 

16 

100 


sounce-uEvm.  oEBuoamo  w  autdmaucau.y  paralleuzed  mooRAiw 


CempaafAMHigflar  Intwtae* 


Eacli  cai  poteniiaily  have  its  own  mapping  between  processor  indexes  and 

values  of  dlat  variables.  Wto  the  debugger  is  invoked,  it  computes  this  mapping  by 
expanding  the  rap  and  dint  a  of  a  program.  For  every  statement  with  a  Pm  we  record 
the  «*■*»■»«*”«  number  Crom  the  P  program,  the  value  of  the  HL,  and  the  value  of  the 
dine  variable.  As  an  example,  we  compute  the  mapping  for  the  statements  in  the  bkxk 
distriboied  prognsn.  Assume  that  there  are  only  2  processors;  _np  is  equal  to  2.  We 
only  need  to  know  the  value  of  diet  variables,  so  we  can  ignore  everydiing  before  tine 
7.  Funhetmore.  the  PIL  &r  every  statement  is  the  same,  so  the  nuppiiig  is  the  same  for 
aU  siatementt.  The  mtpping  is  as  follows: 

diet  variable  value  procmsor  index 

0  0 

1  1 

The  mapping  for  the  cyclically  distributed  jWDgram  is  the  same. 

After  a  processor  is  stopped  because  of  an  event,  its  virtual  time  is  computed  as  follows. 
FramtheUnenumber,  we  geta  virtual  time  template.  If  the  template  has  dint  variables 
in  it  their  values  are  obtained  from  the  table  that  maps  processor  ind»es  to  dist  vari¬ 
ables.  If  the  template  has  any  loop  variables  in  it,  their  values  ate  obtained  from  the  loop 
variables  in  the  program. 

As  an  example  of  computmg  the  virtual  time,  if  processor  1  stops  at  statement  13  in  the 
cyclically  distributed  program,  the  virtual  time  template  is 
(Z,k,8t_t,9J>,10,nl,13,j.l3).  If  (be  value  of  the  loop  counters  are  ka3,  _t«l,  ni«l, 
and  ja2,  then  the  current  virtual  time  on  processor  1 15(2,3,8,1,9,1,10,1,13,2,13). 

6,4,2  S«l«cting  tha  ^  sotting  breakpoints,  examining  and  modifying 
variables,  and  reporting  the  current  location 

Before  the  debugger  can  decide  if  commands  should  be  disallowed,  it  must  compute  the 
early  operations,  ft  must  first  choose  t  the  P  time,  which  is  computed  ftom  the  virtual 
time  of  each  prooessoc  The  P  time,  along  with  the  virtual  time  template  is  used  to  deter¬ 
mine  if  each  statement  has  been  executed  early.  For  the  cyclically  distributed  program, 
if  the  virtual  time  of  processor  0  is  (2,3,8,1,9,0,10,1,13.5,13),  and  the  virtual  time  of 
processor  1  is  (2,3,8,1,9,1.10,1,13.2,13),  then  the  ^  time  is  the  minimum,  or 
(2,3,8,1,9,0,10,1,13,5,13).  For  cyclic  distributions  in  general,  tbe  P  time  is  Ibe  time  of 
the  processor  that  has  executed  tbe  least  number  of  outer  loop  iterations.  For  the  block 
distribution,  it  is  the  first  processor  that  has  not  executed  all  of  its  iterations. 

After  computing  tbe  virtual  time  sets  for  each  statement  and  processor,  we  conclude  that 
statements  13  and  14  have  early  executions  on  processor  1.  Tbe  rest  of  (be  section 
assumes  that  the  program  is  in  tbe  state  describ^  above. 

If  tbe  user  wants  to  set  a  breakpoint  on  line  13.  the  debugger  for  distribution  sets  a 
breakpoint  on  every  copy  of  that  statement;  there  is  one  copy  for  each  processor.  Tbe 
debugger  for  thread  spUtting  must  check  if  any  of  those  statements  are  early.  For  our 
example,  line  13  on  processor  1  has  an  early  execution,  so  the  debugger  must  disallow 
this  Ixeakpoint  If  tbe  user  sets  a  breakpoint  oa  line  2,  then  tbe  debugger  for  thread  spiit- 


SOUHC6-4JVEL  PgBUQOWQ  OP  AUTOmUCAiXY  PARAti.EIJZED  PHOCaUMS  101 


A  dibuggar  for  a  eoinpitor  with  block  and  eyelie  ^•trfcutkm* 


ting  would  set  a  b(eaki»itt  m  both  copws.  Ibe  thiead  detenBines 

that  tbere  «e  no  early  executkm  for  any  of  the  copies  of  line  2  and  sets  the  bnakpointi. 

the  user  examines  the  variable  d,  then  the  debugger  for  distributioa  iniqwcts  the  copy 
of  d  that  is  local  to  the  processor  executing  the  cunent  statement.  Tta^  is  processor  0. 
On  processor  0,  there  are  no  eariy  operadons.  thus  there  cannot  be  an  early  write  of  the 
variable  d,  so  the  thread  splitting  debugger  does  the  read  of  the  variable  d  on  processor 
(I  If  the  oser  tries  10  modify  the  vmiable  d,  then  the  debugger  for  distrawtioo  niodifies 
aU  copies  of  the  variable.  Ibe  debugger  for  thread  splitting  cbedcs  if  there  ate  any  early 
reads  or  writes  for  the  vatiaUe  d  for  all  oopiea.  Processor  1  baa  early  executioa  of  state¬ 
ments  13  and  14,  and  from  checking  the  table  in  FIGURE  6>3,  it  concludes  that  there  is 
as  early  read  of  the  vaiiabied.  ft  must  disallow  the  modificadoo  of  the  variable  d 
because  one  copy  can’t  be  modified. 

If  tbe  user  does  airiMr*  command,  tbe  ddmgger  for  thread  splitting  passes  back  the 
currem  statement,  which  contains  the  statement  number  and  processor  index.  The  distri¬ 
bution  debugger  just  peels  tbe  processor  index  and  returns  the  statement  nuinb«'.  In 

the  current  state,  the  thread  splitting  debugger  passes  bade  line  13  on  processor  0  as  tbe 
current  line  and  tbe  distribudoo  debugger  passes  back  line  13  is  the  current  statement 
Tbis  is  reported  to  the  user  as  the  current  line. 

64.3  Roll  fotward 

RoU  forward  is  used  for  three  purposes,  ft  is  used  to  advance  exeetdioD  of  a  program  so 
that  there  are  no  late  operadons,  to  rerun  the  program  to  a  particular  ^  tune,  and  to  run  a 
ixt^ram  widi  a  coosisteu  ixneakpomt  The  virtual  time  template  and  current  virtual  time 
are  used  to  find  the  LUB  for  each  processor,  which  is  the  point  in  the  program  H  which 
it  should  stop. 

If  tbe  desired  point  in  the  blodt  distributed  program  is  k«3,  ni«l,  and  j  on  state- 
mentll  of  processor  0.  then  if  we  substitute  tbe  values  into  the  virtual  time  template  of 
(2Jt,7.b,8,al,ll,j,ll)  we  have  a  virtual  time  of  (2,3,7,0,8,1,11.2,11).  This  is  also  the 
LUB  of  processor  0.  For  processor  1.  tbe  LUB  is  at  tbe  begmning  of  its  copy  of  the  1 
loop.  This  is  at  statement  11,  with  a  vinual  time  of  (2,3,7,1,8,0,11,0,11). 

We  use  FIGURE  6-4  to  UIustratB  the  general  coDdidoDS  for  finding  tbe  LUB  for  each 
processor.  We  use  a  block  distribution  with  a  Mock  size  of  3  and  acyclic  distribution 
where  each  processor  gets  3  iteratioos.  In  the  Mock  distribution,  for  processors  thm  exe¬ 
cute  iterations  earlier  than  the  target  time,  tbe  LUB  is  tbe  first  thing  after  tbe  loop;  tbe 
processor  executes  its  entire  block.  For  processors  that  execute  iterations  after  the  target 
time,  the  LUB  is  the  first  itermioD  that  it  executes;  the  beginning  of  the  Mode.  Fbr  the 
cyclic  distributirai  in  the  figure,  o  is  the  outer  loop  counter  and  1  is  the  inner  loq;* 
cotmter  (the  inner  and  outer  loop  are  the  result  of  the  strip  ntine).  Tbe  LUB  for  a  proces¬ 
sor  is  either  tbe  same  or  next  iteration  of  tbe  oatar  loop,  depending  on  whether  the  pro¬ 
cessor  comes  before  or  after  the  processor  executing  the  target  time. 

64.4  ConsintMit  braakpoints 

If  the  user  sets  a ccosisteat  breakpoint  in  aprogram.  then  we  must  compute  the  LUB  for 
that  breakpoint.  FIGURE  6-3  illussates  where  tbe  LUB  is  for  each  processor  before  we 


10k 


SOURa!-IJEVeLD6aU<Wa<QOPAUnHaTICUMXYPAIUiiagW>PI»OIUM 


CpnipSwMMbufiQSf  intaffwc 


RGURE  S-« 


Th«  potition  c4  th«  LUB  for  a  givan  targat  tima  with  tha  block  tmd  cyclic  datiftHJtion 


Pnxesior  index  Processor  index 


start  execution  of  tbe  loop.  For  tbe  block  distributed  progtam,  tbe  LUB  for  each  proces¬ 
sor  is  tbe  beginning  of  tbe  block.  If  we  reach  a  LUB  for  a  processor  without  hitting  a 
breakpoint  firsc  then  tbe  new  LUB  for  that  processor  is  tbe  end  of  tbe  loop. 

For  a  cyclic  program,  tbe  LUB  is  always  tbe  next  iteration  that  the  processor  executes. 
After  we  pass  a  LUB  for  a  processor,  tbe  new  LUB  for  that  processor  is  the  next  itera¬ 
tion  it  executes. 


6.5  Performance  of  block  and  cyclic  distributions 
when  debugging 


We  conclude  this  chapter  by  investigating  some  performance  related  issues  of  ddnig- 
ging  for  tbe  block  and  cyclic  distributions.  The  actual  performanoe  that  a  user  sees  is 
dependent  upon  the  particular  program,  its  communication,  whme  tbe  user  decides  to 
stop  tbe  program,  and  what  variables  ate  inspected  or  modified.  However,  it  is  possible 


aOUBCg-LgVB.  OCBUOGaiQ  OF  AUTOmnCALLy  PARAU  R  IHn>  PfIOGWMWI  103 


A  dabuggw  for  •  eompilw  iHtti  Moek  and  eycite  cMriHitieiM 


nGURE  ThapoakkMofltMLUBforeonMtantbrMkpokitowimbtodcafKicydfcdMrftHkions 


BkKk 


Processor  index 
0  1  2 


Cyclic 
OMribudon 


Processor  index 
0  I  2 


Time 


c^4-0  -^LUB 

-^LUB 

o-0>2-^LUB 

o-U-3 

o«U«5 

OB2>i6 

o«24»7 

o»24«8 


u  nuJca  seme  simple  statements  about  tbe  relative  perfbnnanoe  ot  the  loop  disiiibutioa 
models.  For  tbe  following  analysis,  we  assume  tbat  all  processors  complete  loop  bodies 
at  tbe  lam.  TUs  is  tnie  if  there  is  no  synchronization  in  the  loop  body,  and  tbe 
work  in  each  iieratioo  is  data  independent 

6,5.1  Early  opnradofis 

The  more  early  operations  tboe  are,  the  more  Ukely  it  is  that  we  would  have  to  disallow 
a  flnmniflnri  The  block  and  cyclic  distributions  can  be  expected  to  have  diferent  behav¬ 
ior  in  tenns  of  the  number  of  earty  opetatioDS. 

Wb  count  the  number  of  eariy  loop  iteradoos  to  compare  which  distribution  has  more 
eariy  operations.  For  both  di^butions,  the  p  time  is  the  earliest  virtual  time  of  aU  the 
ptocesaon.  Under  tbe  equal  progress  assumptioa,  for  a  block  distributed  loop  tbe  earli¬ 
est  virtual  time  is  always  on  tbe  first  processor.  This  is  because  rdl  of  tbe  iterations  on 
tbe  first  processor  have  virtual  times  less  than  tbe  iterations  of  all  tbe  other  processors. 


104 


SOURCe4AmoeBUGOMOOF/ttrTOIIA11CAU.Y  PARALLILBED  PROGRAMS 


ComptaMabuggar  intartee* 


Hence,  aO  (be  tceradoos  exeoaed  by  aO  die  piooessan  excqK  for  the  fiist  one  are  eariy. 
Thisisx(p-1)  wbere  1  is  the  nimber  of  Ueraiioas  executed  on  one  processors  and /> 
is  die  number  of  processors.  For  die  cyclic  distribution,  iteiattoiis  are  assigned  in  round 
robin  Castiioo.Wtii]e  iteration  t  is  in  progress  on  processor  0.  iterations  i+  1  duougb 
i-t-p-l  are  in  progress  on  tbeotber  processors.  All  tbeiteratioiis  in  progress  (»aQ  bin 
tbe  first  processor  are  ealy.  butall  oto  iterattons  tdieady  compieted  are  not  early.  Tbe 
number  of  eariy  operatioos  for  tbe  cyclic  distribution  is  at  most  p  - 1 .  In  comparison  we 
would  expea  a  k)^  that  is  block  (Bstributed  to  have  many  more  eariy  operations,  bence 
it  is  more  likely  thtt  commands  would  be  disallowed  when  using  this  distribution. 

6.5,2  PamlMism  during  roH-torward 

When  we  want  to  stop  the  program  at  a  paticular  virtual  time,  for  example  we  are  roll¬ 
ing  forward  a  sa  of  {Kocessors,  that  we  would  like  to  execute  the  program  with  tbe 
nuniminn  paialkiism  available  so  that  we  may  reach  that  poim  as  quickly  as  possible.  If 
we  have  a  block  distribution  and  we  want  to  stop  at  virtual  time  j,  then  we  can  allow  aU 
processors  with  an  index  less  than  L  i/b  J  to  execute  their  complete  loops,  while  proces¬ 
sors  with  an  index  greater  than  fi/bl  cannotnotexecuteany  iterations.  Tbe  degree  of 
parallelism  available  is  thus  detennined  by  tbe  desired  loop  iteration;  tbe  higher  the  loop 
iteration  the  more  parallelism  that  can  be  exploited. 

For  the  cyclic  distribution,  since  iterations  are  assigned  to  processors  in  a  round  robin 
Gashioo.  we  can  aOow  all  of  the  processors  to  execute  the  first  Li/p  J  iterations  that  they 
are  assigned.  The  remainder,  which  is  i  -  L  i/p  J  p  are  executed  by  processors  0  through 
i-Li/pJp-1. 

Thus,  a  cyclic  distribudoo  wiU  reach  tbe  targa  point  &s(er  because  it  can  exploit  more 
parallelism.  If  we  want  to  ctecute  tbe  first  i  iteiatioiu  of  tbe  source  loop,  then  the  block 
distribution  will  take  min  (/i/p,  0  steps,  where  n  is  the  total  number  of  iterations  to  be 

executed.  The  cyclic  distribution  wUl  take  1  j  steps  to  complete  the  first  |^  -  j  P 
tions  and  1  more  step  to  complete  tbe  remainder,  which  is  i-p|^  i  j .  Tbe  total  is  thus 


6.5.3  Synchronization  ovarhoad  of  consistent  broakpointe 

When  we  execute  a  Mock  or  distributed  loop  with  a  consistent  breakpoint  set,  we  cannot 
take  advantage  of  any  parallelism,  because  the  breakpoint  may  stop  at  any  iteration.  If 
we  want  to  stop  in  a  slam  without  any  eariy  iterations,  we  can  only  execute  one  iteration 
atatime. 

Tbe  execution  of  a  loop  with  a  consistent  breakpoint  should  be  slower  than  the  sequen¬ 
tial  version  of  the  loop  because  extra  syncbroniration  must  be  done.  Synchronizmkm 
occur^wbenever  a  pn^ram  reaches  its  LUB;  a  breakpoint  occurs  and  the  processor 
waits  for  the  debugger  to  la  it  go  forward.  The  debugger  lets  it  go  forward  when  ail  tbe 
iteratioiis  with  lower  virtual  times  complete.  As  described  in  Section  6.4.4,  each  proces- 


SOURC64.EVELOeBUaaWQOPAIITOIM11CAUYPAfUUJEUZEOPfWQIUIIS  105 


A  dtbuggw  for  •  eotnpflar  «Mh  Moek  and  eyelfe  dtatributieM 


sor  of  die  bk)dc  disttfl)oied  procnm  mutt  syocbrooiZB  before  execatiBg  any  <tf  its  Mode 
of  iteraiion».ODceitliMpuMedittLUB  itdoreootneedmsyndirooueagaiBaBtiltfie 
eod  of  dm  kxip.  Tlw  cydk  dittributkm  mutt  syoebreoire  before  every  ttnattoo. 

Tbe  block  dittributioo  can  be  expected  to  synebrooize  less  tban  tbe  cyclk  distnbutioo; 
beiicecaosisiBmbrealqx}inawiabetetBrfimtfaebbxkdisiribothm.FiordiebiockdiS' 
tributioa.  if  dm  real  brealqmim  ocean  in  iieniioo  i.  diea  dm  Bundier  of  syoebronizatioo 

pobttsis  |^^J.F«acyclfc(BstrifautioD,dmmBid»ofsjmcbreDizad(Oopoiattis  i. 


6.6  Summary  _ _ 

We  have  presented  examples  of  compibuioo  aud  ddMigger  fimedons  for  prognaus  with 
block  and  cyclic  distributioos.  We  also  examined  some  perfbnnance  issues  related  to 
debi^ging.  The  cyclic  distribution  can  be  expected  to  have  less  eariy  operadoos  and 
hence  is  less  likely  to  disallow  a  command.  It  is  also  able  to  exploit  more  parallelism 
when  tolling  forward  to  a  particular  virtual  time.  However  tbe  block  distribution  can  be 
expected  to  have  less  ovetfaead  when  executing  with  a  consistent  breakpoint 


•otiBcg^gmogBuoobiQorittiTOiuTiCAtxYpaRMiaiyFnwtoowMit 


CHAPTER  7 


Limitations  of  thread 
splitting 


In  this  chapter,  we  identify  the  types  of  parallelizing  transfonnations  for  which  it  is  pos¬ 
sible  to  construct  a  debugger  using  the  thread  splitting  methodology.  The  main  charac¬ 
teristic  of  a  transformaiion  that  determines  if  it  is  suitable  is  the  type  of  assignment  it 
uses  to  map  operations  in  the  source  program  to  processors.  An  assignment  of  a  sequen¬ 
tial  program  is  specified  by  giving  each  operatioo  a  processor  index  label  (PIL).  The  PIL 
is  die  processor  that  executes  the  operation  in  the  (nrallel  program.  Wb  call  an  assign¬ 
ment  ^operations  to  processors  dlam  independent  when  the  PIL’s  are  constant  We  call 
an  assignment  of  operations  to  processors  with  non-cmstant  PEL’S  a  data  dependent 
assignment  Thread  splitting  only  allows  data  independoit  assignments  because  PIL's 
must  be  oonstanL 

A  data  independent  assignment  is  desiraUe  because  it  makes  it  possible  for  the  debug¬ 
ger  to  take  ^vantage  of  static  program  infisrmation,  which  is  important  for  efficiency. 
However,  most  parallelizing  transformations  use  data  dependent  assignments;  a  practi¬ 
cal  compiler  and  debugger  must  be  able  to  support  both. 

If  a  transfatmation  uses  a  data  dependent  assignment,  source  programs  for  the  transfor¬ 
mation  can  be  rewritten  so  that  a  data  independent  assignment  can  be  used  to  achieve 
the  desired  mapping  of  operations  to  processors.  TIhis,  a  debugger  treed  only  allow  data 
independent  assignments.  However,  a  program  generated  in  this  way  may  be  less  effi¬ 
cient  than  a  program  generated  from  a  data  dependent  assignment  For  transformations 
where  the  assignment  is  data  dependent  but  known  at  compile-time,  the  loss  in  effi¬ 
ciency  is  small.  This  indudes  the  block,  cyclic,  and  block-cyclic  distributions.  For 
transfcmnations  where  the  distribution  of  opetatitms  to  processor  is  dynamic,  the  loss  in 
efficiency  can  be  significant  This  includes  user  nugiped  and  dynamkally  load  balanced 
distributions. 


fiOUk  Jt-tgygL  OCTUQCMMOI  OF  AUTOMAT1CAU.Y  PABAllElgEP  PHOORaiW  10^ 


LMMieiM  of  tliPMd  •pUttiflfl 


b  Socttoo  7.1,  we  defioe  dsio  iodepcodeot  snd  dots  dependent  assignmcDB  ond  describe 
tl>dfid«rioojihiptDtfiread8pattingapd(k*wggew.Sectioo72explaiMwhyapn^yam 
geacTMed  with  a  data  ipdepeiKientaaaigiineot  cap  be  teas  eflfcicat  than  a  program  gen- 
ented  with  a  data  dependent  assignment  Sectioo  7.3  shows  how  to  rewrite  a  program 
so  that  a  data  indepeadent  assignment  can  be  used.  Sectko  7.4  describes  how  to  bound 
the  overhead  for  an  assignment  of  a  iransfonnsiioo  and  analyzes  the  commonly  used 


7.1  Data  dependent  and  Independent  assignments 


Geneaiing  a  pacalld  program  fiom  the  sequential  one  can  be  modeled  as  an  asrifiimefit 
of  operations  to  processors.  The  assignment  can  be  thought  of  as  the  spedfication  the 
distribution.  Alternatively,  a  distribution  can  be  thought  of  as  an  implementation  of  an 
assignment. 

An  assignment  is  data  independent  if.  for  all  executions,  an  operation  is  always  exe> 
cuted  on  every  member  of  the  same  set  of  processors,  even  if  the  operation  is  executed 
mote  than  once.  An  operation  may  be  executed  more  than  once  if  it  is  inside  a  loop.  A 
data  indepeodeat  assignment  is  shown  in  FIGURE  7>1.  All  iterations  of  the  first  loop  are 
always  executed  on  processor  0,  an  iteratiottt  of  the  second  loop  are  always  executed  on 
processor  1,  and  all  iterations  of  the  last  loop  ate  always  executed  on  processor  2. 

FIGURE  7*1  Examplfla  of  data  dapundattt  and  data  independent  schedules 


data  independent  assignment 


<0> 

for  (i 

a  0;  i  < 

n/3;  i++) 

<0> 

b  s 

1; 

<1> 

for  (i 

a  n/3;  i 

<  2*n/3;  i++ 

<1> 

b  3 

1; 

<2> 

for  (i 

a  2*n/3; 

i  <  n;  i++) 

<2> 

b  s 

1; 

<i/3> 

data  dapendent  assignment 

for  (i  a  0;  i  <  n;  i++) 

<i/3> 

b  a  1; 

An  andfiameBt  in  data  dependent  if  the  fimction  that  assigns  operatirms  to  processors  is 
dependent  on  data  in  the  program.  An  example  can  be  found  in  FIGURE  7-1.  The  tesuit 
of  dividing  the  loop  counter  by  3  is  the  index  of  the  processor  that  executes  the  iteration, 
bi  general,  any  (fisaribution  of  loop  iterations  (e.g.  block  and  cyclic)  requires  a  data 


SOURCE4jmoemXMnNQ  OF  AUTOIMTItM^  PARALLEUZED  PROGRAMS 


Why  proyim  wrtth  drti  lnd«p«od«nt  — ignmnf  c«n  b«  I—  «<nci<nt 


dqModm  assignineiiL  lluead  spUtdng  cannot  be  lued  (Uiectiy  wfaeo  a  data  (kpendeDt 

— is  desired  because  o^y  wiwtant  PIL's  are  allowed. 

A  data  mdepeadeat  asstgoment  is  desiraUe  because  it  makes  it  possible  for  tbe  debug¬ 
ger  to  take  advantage  (rf  static  infixmaiiao  about  the  assigBBoeat  of  operatioos  to  proces- 

son  in  a  pwngram  Static  infionnadoo  is  useful  for  mechanisms  relaa^  to  virtual  time 
such  as  computing  early  operations,  rolling  forward  a  ptogram,  and  ooosistBot  bttak- 
pointt.  If  the  ddMgger  does  not  take  advantage  of  static  infonnatioti,  iben  it  would  not 
be  practical  to  provide  some  of  these  mechanisms. 

A  simple  example  of  how  a  debugger  can  talm  advantage  of  static  inftxmation  can  be 
found  in  FIGURE  7*1.  If  we  use  a  data  independem  assignment  as  is  shown  in  tbe  top  of 
the  figure,  and  the  current  statement  is  selected  to  be  the  third  statement  (the  for  loop), 
then  it  is  dear  that  processm’O  should  have  executed  all  of  its  iterations  and  processor  2 
should  not  have  executed  any  iterations  if  we  want  the  current  state  to  be  consistent  with  . 
a  state  of  the  0  program.  By  contrast,  if  we  have  the  data  dependent  assignment  found  in 
the  bottom  of  tte  figure  and  the  current  line  is  chosen  to  be  the  second  line  of  that  pro¬ 
gram,  then  the  debugger  must  know  sonetbing  about  the  behavior  of  the  function  in  the 
PIL’s  and  the  variables  they  refiueoce  to  determine  which  iterations  each  processm* 
should  execute. 


12,  Why  programs  with  data  independent  assignments 
can  be  less  efficient 


The  execution  overhead  introduced  by  rewriting  a  program  so  that  a  data  indqiendent 
assigmnent  can  be  used  is  inherent  in  bow  much  is  known  at  cmnpile-time  about  the 
assigmaenL  If  the  assignment  is  known  at  compile-time,  then  the  overhead  is  low.  If 
there  are  points  in  execution  of  the  program  where  the  assignment  can  only  be  resdved 
at  run-time,  there  is  some  overhead.  The  total  overhead  depends  on  bow  often  the 
assignment  must  be  resolved  at  tun-time. 

To  illustrate  the  differences  between  data  dependent  and  data  independent,  we  describe 
an  example  wbere  the  assignment  cannot  be  known  at  compile-time.  We  describe  bow 
to  write  the  program  so  that  a  data  dependent  assignment  can  be  used,  then  show  bow  a 
data  dependent  version  cmi  be  more  efiSdenL 

If  we  are  allowed  to  use  a  data  dependent  assignment,  we  can  simply  give  each  opera¬ 
tion  in  the  P  pogram  a  PIL  that  computes  the  processor  which  should  execute  it  In  the 
(0  program,  which  is  geneimed  ftorn  the  ^  program,  only  the  processor  that  is  assigned 
the  operaticn  executes  it 

An  example  of  a  data  dependent  assignment  is  found  in  FIGURE  7-2.  In  this  program,  a 
user-defi]^  map  is  used  to  determine  which  processor  executes  each  iteration;  iteration 
i  of  the  loop  is  executed  on  processor  aap  [  i  ] .  Because  the  array  amp  is  set  at  run-time, 
any  processor  can  execute  any  Heraiioa.  Assume  that  at  run-time,  before  executing  the 
lo^,  the  program  computes  the  set  of  iterations  that  each  processor  must  execute.  The 
result  is  stored  in  two  arrays — its  and  uawp.  For  processor  p.  Its  Ip]  is  the  number 
of  iterations  executed  by  t^  processor  and  {unspip]  lOl.maapCp]  Ill. .... 


SOtmCC-LE  .Isi  OEBUQGMQ  OF  AUTOMAftCALLY  PARALLEU2ED  PfIOQRAin 


109 


LMMiOM  of  thraad  aplttiiis 


wip(p]  Cltalp]  -11)  U  the  set  of  itecBtkxttttiit  pracessorp  executes.  la  tbeixo> 
gnm,  each  processor  just  executes  its  assigned  set  of  itenoioiis  aocoitfiDg  to  the  It* 
andtasepanays. 

The  PIL  for  each  staCBment  uses  the  aep  anay  to  specify  which  processor  executes  the 
teaiioa  Some  statements  have  a  PIL  of  aep  Ill  andothershaveaPILofMpIl+l] 
the  vaiiaMe  1  is  incremeoied  in  the  middle  of  the  loop.  In  the  m  program,  each 
processor  only  executes  its  assigned  Ueratkms  and  does  not  execute  the  loop  control  for 
any  other  itemtioiis. 

The  p  program  that  must  be  used  ftor  a  data  indqieadem  assignment  is  very  dUTeient  for 
this  example.  In  general,  if  there  is  a  point  min  the  control  flow  of  the  P  program  where 
any  one  of  a  set  of  processors  j  can  be  assigned  the  same  computation  c.  then  for  every 
processor  p  in  r,  tbm  must  be  one  executioo  path  starting  at  m  that  has  a  copy  of  the 
computation  c  with  a  PIL  of  p.  A  conditional  in  the  program  decides  at  run-time  whk± 
path  to  execute,  in  effect  deciding  which  processor  should  execute  the  computation  c.  In 
the  <B  program,  which  is  extracted  tom  the  0  program,  every  processor  in  s  must  exe¬ 
cute  tite  oonditkmaL  The  processor  that  is  assigned  the  curent  instance  of  c  executes  the 
path  which  oontaina  c  as  weO. 

This  is  iOusoaied  by  the  program  with  a  data  independent  assignment  in  FIGURE  7-2. 
As  before,  a  useislefined  map  is  used  to  determine  which  processor  executes  each  itera¬ 
tion.  Because  any  processor  can  execute  any  iteratioa,  there  must  be  one  execution  path 
through  the  loop  body  of  the  P  pco^am  for  each  processor.  In  the  ce  program  in  the  fig¬ 
ure,  every  processor  executes  the  loop  control  once  for  every  iteratioa  of  theP  program 
and  a  conditioaal  decides  if  the  processor  executes  thtt  particular  iteratioa.  Bycompari- 
300,  a  program  where  a  data  dependent  assignment  can  be  used  is  much  more  efficirat 
because  each  processor  only  executes  the  loop  control  for  its  assigned  iteratioas. 

Overtiead  is  any  work  present  in  a  program  with  a  data  independent  assignment  that 
need  not  be  present  in  a  version  of  the  program  with  a  data  dependent  assignment  Fm- a 
pmticular  data  independent  aasignmeitt,  the  overhead  the  program  incurs  is  dependent 
on  the  way  the  P  program  is  written.  If  we  know  at  compile-time  bow  work  is  assigned 
to  processors,  then  we  can  change  the  structure  of  the  program  so  that  we  can  reduce  the 
number  of  prooesson  that  can  potentially  execute  the  next  operatioa.  By  doing  this,  we 
reduce  the  overtiead  of  the  program.  If  the  assignmem  is  known  completely  m  compile- 
time,  then  there  is  no  overfa^  because  we  can  construct  a  program  where  only  one  pro- 
cessor  executes  the  next  iteratiaa. 

The  example  in  FIGURE  7-3  illustrates  the  wi^  we  can  exploit  compile-time  knowl¬ 
edge  the  assignment  to  reduce  overhead.  The  program  on  top  is  the  a  program;  below 
are  two  possible  p  and  CO  programs  for  the  same  awagnment^  the  assignment,  itera- 
don  t  of  the  a  loop  is  executed  on  processor  <73.  We  assume  that  there  are  3  processors 
m  the  example.  In  the  first  p  program,  there  is  one  execution  path  through  the  loop  body 
for  each  possible  assignment  of  an  iteration  of  the  a  loop.  That  is,  if  an  iteration  is 
asstped  to  processor  0,  the  body  of  the  first  1C  is  executed,  if  an  iteratioa  is  assigned  to 
processor  1,  the  body  of  the  second  If  is  executed,  and  so  on.  hi  the  m  program,  evoy 
processor  executes  the  loop  coniroi  for  every  iteratfoo  as  well  as  io  assigned  iteratioas. 
hi  the  seamd  P  program,  we  have  oeaied  three  copies  of  the  a  loop.  Because  (here  are 


110 


hJOriCCAEVCL  OeBUGGONQ  OP  AUTOMATICALLY  PARALLELIZED  PROORAMt 


Why  piogwww  with  data  liwiapandatit  MsigninMit*  ean  b*  Im«  allleiMit 


HGliftE  7-2 


Examptaw  oi  data  dapandant  and  data  mdapandant  achacfeilaa 


data  dependent  assignment 
^program 

<0>,<1>,<2>  i  a  -1; 

<Biap[i+l]>  for  (t  a  0;  t  <  its[_id];  t-*-+)  { 
<inap[i+l]>  i  s  uinapt_id]  [t]  ; 

<niap[i}>  b  s  1; 

} 


0)  program 

i  =  -1; 

for  (t  =  0;  t  <  its(_id];  t+-*-)  { 
i  =  uinap[_id]  [tl  ; 
b  =  1; 

} 


data  indepen^nt  assignment 


<0>, <1>, <2> 

P  program 

for  (i  a  0;  i 

<  2;  i++ 

<0> 

if  (map(i] 

aa  _id) 

<0> 

b  a  1; 

<1> 

if  (jnapfi] 

aa  _id) 

<1> 

b  a  1; 

<2> 

if  (mapCi} 

«a  _id) 

;2> 

b  a  1; 

) 


CD  program 

for  (i=0;  i<2;  i++)  { 
if  (mapCi]  s*  _id) 
b  a  1; 

} 


SOURCE-LEVEL  OEBtJtGMO  OF  AUTOMATICALLY  PARALLELIZED  PROGRAMS  111 


UfflltatkNM  of  throad  oplitting 


moitiplecoiriesof  ite  loop,  at  every  poult  m  the  executioo  of  tlie  ^  program,  only  one 
procesaor  can  execute  the  next  iteration  of  the  a  program,  hi  the  m  program  every  ptt>- 
cessor  only  executes  its  assigned  iteraiioos. 

RGURE  7*3  Two  poeafele  ways  of  writing  a  p  pro^am  for  tha  same  assignment 

aprogram 

for  (i  a  0;  i  <  n;  !♦+) 
b  a  1; 


Ppiogram 


mprogiam 


<0>,<1>,<2>  for  (i  a  0;  i  <  n;  i++) 
<0>  if  (_id  a=  i/3) 

<0>  b  a  1; 

<1>  if  (_id  as  i/3) 

<1>  b  a  1; 

<2>  if  (_id  aa  i/3) 

<2>  b  a  1; 

) 


for  (i  a  0;  i  <  n;  i++)  ( 
if  (_id  =a  i/3) 
b  a  1; 

) 


P  program  ca  program 

<0>  for  (i  a  _id*n/3;  i  <  (_id+l)n/3;  i++)  for  (  i  a  _id*n/3; 

<0>  b  a  1;  i  <  (_id+l>n/3; 

<1>  for  (i  a  _id*n/3;  i  <  (^id+l)n/3 ;  i++)  i++) 

<1>  b  a  1;  b  a  1; 

<2>  for  (i  a  _id*n/3j  i  <  {_id+l)n/3;  i++) 

<2>  b  a  1; 


7.3  Generating  a  ^  program _ 

In  this  section,  we  describe  how  to  structure  a  p  program  to  minimize  overhead  when  a 
data  independent  assignment  used.  We  vrant  to  be  aUe  to  describe  the  asagnment  of  a 
transfbrnuuioa  independent  of  the  particular  program  to  which  it  is  applied.  For  this  rea< 
son,  we  use  a  simple  model  of  a  program  as  a  single  loop  with  a  single  statement  body. 
The  iterations  of  the  loop  are  mapped  to  a  1  dimensional  processor  ^ray. 

For  distributkm,  the  compiler  inputs  an  a  program,  and  uses  a  cranbinadon  of  condi¬ 
tionals,  loops,  and  sequences  of  code  to  produce  a  p  program  with  a  data  independent 
assignment  such  that  each  processor  executes  the  appropriate  iteratian  of  the  loop  &ran 
(be  a  program.  Each  construct  is  useful  for  differ  siii  puttems  in  the  assignment 


112 


90Una&t,£VEL  OCBUQQMO  OF  AUTOMAUCALLY  FARAllginTO  PROORAMS 


4.* 


OMMtaUiHi  •  b  pragfam 


If  the  assignment  follows  soBoe  finite  sequence  that  is  fixed  a  compite-time.  then  we  can 
creaiB  mult^le  copies  of  the  loop  body  of  'be  a  program  as  a  sequence  and  assign  each 
copy  to  the  apprapiiaie  processor.  An  example  is  shown  in  part  (a)  of  FIGURE  7-4.  The 
patteni  is  (0^0.1.1.2)  so  we  create  6  copies  of  the  loop  body  and  give  them  PILs  from 
the  sequence.  If  the  assignment  repetitively  assigns  an  iteration  to  the  same  processor, 
then  we  can  put  the  body  of  the  a  program  in  a  loop  and  give  it  the  P1L  of  the  processor. 
This  is  iUustiBied  in  part  (b)  of  FIGURE  7-4  where  a  data  depmdent  ninsiber  of  itera¬ 
tions  ate  assigned  to  processor  0.  If  at  some  point  in  the  ptogtam,  the  assignment  of  an 
itetatian  can  only  be  determined  at  run-time,  tben  a  conditional  can  be  used  to  selea 
wbiefa  processor  executes  the  next  iieratioo.  In  pan  (c)  of  FIGURE  7-4,  iteratioos  can 
either  be  mapped  to  processor  0  or  processor  1.  The  cooditiooal  selects  which  processor 
executes  the  iteratitm. 

Each  of  the  above  constructs  can  be  com>'»fd.  For  examine,  if  the  assignment  repeat¬ 
edly  assigns  iterations  to  the  samr  sequence  of  prxxxssors,  tben  we  nepticaie  the  body 
according  to  the  sequence  and  place  that  inside  a  loop. 


7.4  Estimating  the  overhead  of  data  independent 
assignments 


The  overhead  of  using  a  data  independent  assignment  depends  on  the  parallelizing 
transfonnmion,  the  program,  and  the  data  the  program  is  executed  on.  'b  determine  if  a 
parallelizing  transtoimation  is  suitable  for  tbr^  splitting,  we  would  like  to  bound  the 
extra  work  for  all  programs  generated  by  a  parallelizing  ttansfonnation. 

It  is  Dot  important  how  an  assignment  decides  bow  to  map  iteraiioas  to  processors;  our 
main  concern  is  the  set  of  possible  assignments  for  all  programs.  For  this  reason,  we 
describe  tbe  result  of  assigranent  without  speci^ong  bow  it  is  done.  A  schedule  of  a 
loop  represents  tbe  assignment  of  iterations  to  processors  for  a  single  execution  tbe 
program.  A  ]»ogram  can  have  different  schedules  if  the  program  is  executed  on  differ¬ 
ent  data.  The  schedule  can  be  represented  by  a  string  of  numbers  Pj.  pj, ....  p,  where 
iteration  i  is  executed  on  processor  p,-.  For  tbe  estimation  of  overhead,  we  assume  that 
ail  programs  terminaie,  thus  schedules  are  finite. 

For  each  parallelizing  transformation,  tbere  is  a  set  of  schedules  possible  fee  all  execu¬ 
tions  of  tte  single  loop  program.  Tbe  set  of  schedules  can  be  spKified  with  a  regular 
expression,  where  every  schedule  possible  for  tbat  transformation  is  contained  in  the  set 
of  strings  defined  by  tbe  regular  expression. 

To  denote  a  regular  expression,  we  use  numbers,  which  refnesent  processor  indexes, 
parentheses,  and  the  operators  I  and  *.  The  (iterator  I  is  used  for  selection  of  alternares. 
If  tbere  is  a  regular  expressioa  (a  I  b).  tben  tbe  set  of  strings  in  tbe  language  is  a  and  b. 
Tbe  operator  *  (also  (Meddosure)  is  used  for  repetitkmO  or  more  times.  The  expres¬ 
sion  a(b*)  has  the  following  strings  in  its  language:  a,  ab,  abb.  abbb .... 

A  regular  expression  was  chosen  to  represent  the  set  of  sdiedules  for  several  reasons. 
First  and  most  important,  tbere  is  a  straighiftMwatd  method  for  placing  an  u{^r  bound 


SOUfKX-LEVEL  OEBUOfitNC  OF  Atm)IIATK:AU.y  PARALLEUZED  PROGRAMS 


113 


UmMai  '  w  of  ttirtad  •patting 


FKSURE  7-4 


Convwting  loops  with  data  dspondant  PiU  into  loops  with  data  indspsndont  PILt 

(a) 


Asstgnineat  is  a  segueiioe  of  processor  indexes 
as8ignineitt«(0i;a.l.U) 

for  (i  «  0;  i  <  6;  i**) 


<0> 

i 

0;  Si 

<2> 

i 

X 

l;s, 

<0> 

i 

S 

2;  Si 

<1> 

i 

8 

3;  Si 

<1> 

i 

3 

4;  Si 

<0> 

i 

= 

5;  Si 

(b) 


Assignment  is  a  repetitive 
sequence  of  processor  indexes 

assignment » 1(0)* 

for  (i  =  0;  i  <  n;  !•♦•+) 
si; 


<1>  Sj; 

<0>  for  (i  =  l;i  <  n;  1++) 
<0>  S|; 


Assignment  is  conditional 

assignments  (Oil)* 

for  (i  s  0;  i  <  n;  i++) 
*1? 


(C) 


<0>,<1>  for  (i  =  0;  i  <  n;  i++) 
<0>,<1>  if  (condition) 

<0>  S| ; 

else 

<1>  S|  ; 


on  (he  overhead  of  a  paraileUzing  transformation  given  the  regular  expression  defining 
the  set  of  schedules.  This  is  explatned  in  detail  in  Section  7.42. 

Second,  a  regular  expression  is  always  sufficient  to  describe  a  superset  of  the  set 
schedules  for  a  program.  While  we  may  not  always  be  able  to  write  a  regular  expression 
that  includes  exactly  the  set  of  schedules  possible  for  a  pnallelizing  transfonnation,  we 
can  always  find  one  that  includes  every  possible  schedule.  That  is,  there  may  be  addi- 
tioiial  strings  in  the  language  that  are  not  schedules  of  the  program.  For  example,  the  stt 
of  schedules  for  a  program  where  statements  are  only  mapped  to  processors  0  and  1  is 


114 


aOURCi4JIVELI«BUGaMQ  OF  AUTOMA11CAU.Y  PIMUtmiZED  PnCKWAMS 


Quftfng  ah  program 


always  contained  in  (Oil)*  so  tbis  regular  expresstoo  can  be  used  to  describe  any  set  of 
schedules. 

Tbtrd.  Ibe  regular  expression  is  simple  to  write.  The  regular  expression  defining  the  set 
of  schedules  captures  the  infonnatioa  necessary  to  measure  the  overhead,  but  leaves  out 
most  of  the  det^  that  are  iirelevanL  Writing  ^  tegular  expression  is  much  easier  than 
wtitaig  the  p  program,  hi  Section  7.4.1,  we  list  ail  the  regular  exptesskms  for  the  com¬ 
mon  disiributians.  Ihey  are  short  and  have  simple  structures. 

Fourth,  the  structure  of  the  regular  expression  suggests  a  structure  for  implementing  an 
efBdeat  P  program.  Tb  generate  a  p  program,  every  number  can  be  replaced  with  a  copy 
of  the  body  with  the  number  as  a  F1L,  every  l-clause  can  be  replaced  vrith  a  conditional, 
and  every  closure  can  be  replaced  with  a  loop.  The  regular  expression  does  not  contain 
eiKHigh  information  to  generate  a  complete  P  program  because  it  is  missing  die  condi¬ 
tions  for  assignment  to  a  particular  processor.  That  is,  if  we  had  (0 1 1)  we  know  that  the 
iteration  is  assigned  to  processor  0  or  1,  but  we  do  not  know  when  it  is  assigned  to  each 
processor. 

74.1  Regular  axprasaioM  tor  transtormations 

In  this  section  we  give  the  regular  expressions  for  the  common  parallelizing  transfc.ma- 
tions.  The  four  distributkHis  employed  by  most  parallelizing  compikrs  are  block,  cyclk, 
block-cyclic,  and  user  mapped. 

In  the  block  distribution,  omtiguous  blocks  of  iteradous  are  divided  among  processors. 
Tbe  general  form  fora  1  dimenskmal  processor  array  where  processor  indexes  range 
from  0  to  n  is  (0*1*2*...  n*).  A  schedule  is  some  number  of  itetafious  execiued  on  pro¬ 
cessor  0,  ft^owed  by  some  iterations  executed  on  processor  1,  and  so  (m.  Tbe  block  size 
is  typically  the  same  for  ail  jnocessors.  However,  depending  on  the  program,  some  pro¬ 
cessors  on  either  end  may  not  be  assigned  any  iterations  and  tbe  first  processors  on 
either  end  that  are  assigned  iterations  may  only  execute  a  partial  block.  For  example,  if 
there  is  a  block  size  of  4,  then  some  of  the  scfa^ules  that  are  possiUe  for  an  array  of 
lengths  are;  (XXX}1 1112222, 11112222,  and  (X}ini22.  The  regular  expression  is  general 
enough  to  express  all  of  these  schedules.  It  also  includes  some  schedules  that  are  not 
possible  for  a  block  distributioa,  such  as  one  that  skips  a  processor  when  assigning  iter¬ 
ations  (e.g.  (X)22).  deluding  tbe  extra  schedules  in  tbe  set  does  not  increase  the  potential 
overhead,  as  is  shown  later. 

For  a  cyclic  ttistribution,  iterations  are  assigned  in  a  repeating  pattern.  The  goieial  fonn 
is  (0  D(1 0(2  D...(n  l)(012...n)*(0  0(1  0(2  0...(n  0-  The  l-clause  (i  0  means  that  i  may  or 
may  not  appear  in  tbe  schedule.  A  schedule  can  start  and  end  anywhere  in  tbe  repeating 
pattern.  Some  possible  scfaedules  for  a  three  processor  array  are  012012012, 1201212, 
and  1201201201.  Tbe  regular  etpresskm  we  chose  to  represent  the  set  of  schedules 
includes  some  schedules  which  caimot  occur  when  using  a  cyclic  distribution,  such  as 
02012012  and  01201202. 

The  block-cyclic  distribution  is  a  ocmibination  of  the  two  previous  ones.  Tbe  general 
form  is  0*l^*...n*(0*l*2*...n*)*0*l*2*...n*.  Some  possible  scfaedules  for  a  three 
processor  array  are  001122001122, 12200,  and  00112^112. 


SOURCE-LEVEL  OEBUOOINO  OF  MITt^UIICALLY  PAfULLEliZED  PROGRAMS  115 


UmHalioM  of  Itmod  spUttlng 


For  te  nuvped  dtoOwiioiu  the  asttgxanett  of  itenuioiis  is  detenmaed  by  the  cooler 
of  a  m^ipjiig  areqr  wliidi  is  set  at  naHme.  Since  Dodkoig  is  IcDown  at  cciqjik-tiine, 
regnlar  expRssion  defining  tbe  set  of  scbeduies  most  be  very  geoeiaL  It  has  die  fonn  (0 
1 1 1 2 1 . . .  a)*.  Any  scbeduie  is  possiUe  for  tbis  distribodoo.  A  dynamicatiy  load  bal¬ 
anced  (Ustribudon  is  one  where  the  asaignment  of  work  to  processors  can  be  changed  at 
nm-dme  baaed  on  dynamic  oonditions.  The  regular  expressioo  describing  the  set  of 
schedules  for  a  dynamically  load  balaoced  program  is  the  same  as  the  man>ed  distribu¬ 
tion  because  the  schedule  is  not  known  at  compile-dme. 

7jLZ  Computing  the  ovarhaad  from  a  raguinr  axpraaaion 

hi  this  secdon,  we  describe  how  to  determine  if  programs  generated  by  a  parallelizing 
transformatioo  an  expected  to  have  bi^  overhead.  We  first  explain  how  to  compute  the 
upper  bound  of  tbe  overhead  for  a  stngle  execudon  of  a  program.  We  then  make  some 
assumptions  about  the  expected  behavior  of  programs  to  estimate  the  overhead  for  ail 
programs  under  a  single  parailelizing  ttansfotmaiion. 

In  tbe  ideal  case,  each  processor  executes  exacdy  tbe  work  that  is  assigned  to  k.  In  our 
model,  the  unit  of  work  is  a  locq)  itetadoo.  For  every  iteration  a  program  executes,  it 
must  also  execute  the  loop  control  <mce  (updating  the  loop  counter,  checking  it  against 
the  bounds);  we  count  any  additional  loop  control  or  any  conditionals  not  contained  in 
tbe  loop  body  as  overhead. 

Given  aregular  expresskm  defining  the  set  of  schedules  for  a  program  and  a  schedule  of 
one  execution  of  a  program,  we  can  bound  fimn  above  the  overhead  of  the  program  for 
tbe  execution.  The  procedure  is  as  follows.  W:  have  a  pomier  to  a  position  in  the  sched¬ 
ule  and  a  pointer  to  a  position  in  the  regular  expression.  Both  pointers  start  at  tbe  begin¬ 
ning.  At  each  step,  we  advance  tbe  pointer  in  tbe  regular  expression  according  to  the 
current  value  pointed  to  in  tbe  schedule,  and  then  advance  the  pointer  in  the  schedule  by 
I  position.  There  are  three  constructs  in  tbe  regular  expression:  sequences  of  symbols,  I- 
cbuises,  and  ckMures.  We  explain  each  of  them  below. 

For  a  sequence,  tbe  next  symbol  of  tbe  schedule  must  match  the  next  symbol  of  the  reg¬ 
ular  expression.  If  they  do  not  match,  there  is  an  error,  this  cannot  happen  if  tbe  regular 
expressioo  indudes  tbe  set  of  possible  schedules.  If  there  is  a  we  advance  tbe 

pointer  in  the  regular  expressioo  by  1  position.  When  there  is  a  l-clause  in  tbe  regular 
expressioo,  we  pick  one  branch  and  advance  the  poinier  to  tbe  beginning  of  it  If  the 
next  symbol  in  the  schedule  is  tbe  same  as  the  first  symbol  in  tbe  left  side  of  tbe  i-clause, 
we  move  the  pointer  to  tbe  left  side.  If  tbe  next  symbol  of  the  l<lause  is  tbe  same  as  the 
first  symbol  in  the  right  side,  we  to  move  the  pointer  to  tbe  right  side.  If  neither  uMichM, 
there  is  an  enor.  If  both  match,  then  we  follow  both  paths  in  parallel  and  pick  the  one 
that  ends  without  error  and  has  tbe  least  overhead,  (^onputing  overhead  is  explained  in 
the  next  paragraph.  When  tbe  branch  of  a  l-clause  is  finbbed,  we  advance  tbe  pointer  in 
tbe  regular  expression  to  the  &rst  symbol  after  the  l-dause.  Wben  there  is  a  ckMure  (*). 
we  can  either  move  the  pointer  in  the  regular  expressioo  to  the  first  symbol  after  the  end 
of  the  closure  or  move  it  to  the  first  symbol  in  the  closure.  The  decision  about  skipping 
to  tbe  end  or  moving  to  tbe  beguming  of  tbe  closure  is  made  the  same  way  as  tbe  deci¬ 
sions  fix  the  i-clause— we  compare  tbe  next  symbol  in  tbe  schedule  to  tbe  two  possible 
next  symbols  in  tbe  regular  expresskm.  After  the  pointer  is  advanced  beyond  tbe  last 
symbol  in  a  closure,  we  move  it  to  tbe  beginiiing  of  the  closure. 


lie 


SOURCaE-LEVEL  OEBUGCrtO  OF  AUTOMAUCALLY  PAftAULEUZED  PROORAHS 


Qnwtog  «  b  preflwwn 


Ovcfbead  is  coopuced  as  we  step  tbrough  tbe  regular  expressioo.  Steppiag  tfarough  a 
sequence  incun  no  ovetbeaiL  For  a  l-clause,  every  processor  in  either  clause  incurs  one 
unitof  overtiead.  For  adosure,  every  processor  contained  anywhere  in  the  dosuie 
incurs  one  unitof  overbead  for  every  iepetiti(»  of  tbe  body  of  tbe  dosure.  After  finish- 
ing  tbe  schedule,  we  sum  up  tbe  ovobead  for  each  processor  and  subtract  off  tbe  num¬ 
ber  of  iiecatioas  tbat  were  executed,  because  a  program  must  execde  tbe  loop  cooiroi 
once  for  every  iteradoo.  An  overhead  number  is  an  indicatmncff  bow  much  we  can  lose 
by  using  a  data  independent  assignment  It  is  not  an  exact  measure  of  tune  because  we 
don’t  know  how  much  of  tbe  overhead  is  executed  in  paralld  and  we  don’t  know  tbe  rel¬ 
ative  cost  in  exeaakm  time  of  loop  control  versus  the  body  of  tbe  loop. 

The  above  method  allows  us  to  determine  tbe  overhead  for  one  execution  of  oot  pro¬ 
gram.  We  really  want  to  determine  tbe  expected  overhead  of  a  iransforauuion  on  ^  exe- 
cudoos  of  all  programs.  If  we  make  some  assumptions  about  tbe  behavior  of  programs 
we  can  estimate  tbe  overhead.  Tbe  two  assumptions  that  we  need  to  make  are  that  loop 
iterations  execute  many  times  and  that  tbe  number  of  processors  is  not  larger  than  tbe 
number  of  iterations. 

For  tbe  block  distribution,  tbe  set  of  schedules  is  (0*1*2*...  n*).  There  can  only  be 
overhead  if  a  processor  in  tbe  sequence  does  not  etecute  any  iterations.  Thus,  tbe  over¬ 
head  can  at  most  be  equal  to  the  number  of  processors  and  can  only  be  significant  if  tbe 
number  of  processors  is  much  higher  than  the  number  of  iier^ns  to  be  executed.  This 
is  not  a  common  case,  so  tbe  block  distribution  does  not  incur  significant  overhead. 

For  tbe  cyclic  distribution,  tbe  set  of  schedules  is 

(0 1X1 IK2  l)...(ii  l)(012...n)*(0  D(1 1X2  0-><(o  I).  Tbe  overhead  occurs  in  tbe  l-clauses 
which  is  at  tbe  beginning  or  end  ttf  a  schedule.  If  tbe  closure  is  executed  a  significant 
number  of  times,  then  its  cost  wiQ  dominate  and  tbe  overhead  will  be  negligible. 
Because  we  assume  loops  will  execute  many  times,  the  overhead  of  tbe  cyclic  distribu¬ 
tion  is  expected  to  be  small 

The  block-cyclic  distribution  is  also  expected  to  have  low  overhead  because  it  is  a  com¬ 
bination  of  block  and  cyclic.  Overhead  is  low  when  tbe  pattern  repeats  many  times  and 
the  number  of  processors  is  not  much  greater  than  tbe  number  of  iterations,  which  is 
believed  to  be  tbe  common  case. 

Tbe  mapped  distributian  is  an  exception  in  that  it  has  high  overbead.  The  set  of  sched¬ 
ules  is  (0 1 1 1 2 1 ...  e)*;  there  is  an  overhead  of  1  for  each  processor  for  every  iteration 
of  tbe  loop.  Unless  tbe  work  of  a  loop  body  is  very  laige,  t^  is,  mucb  larger  than  tbe 
work  it  takes  fm-  every  processor  to  execute  tbe  loop  ccmlrol  for  one  iteration,  tbe 
mapped  distribution  incurs  too  much  overhead  and  cannot  have  an  efficient  data  inde¬ 
pendent  assignment  Tbe  same  conclusions  bold  for  a  dynamically  load  balanced  pro¬ 
gram. 

7.4.3  Why  ttM  computed  ovnrhoad  in  an  upper  bound 

Our  computation  of  the  overhead  for  a  schedule  of  a  program  is  pessimistic;  using  a  reg¬ 
ular  expression  to  meastue  overhead  can  only  give  us  an  upper  bound.  Tbe  answer  is  not 
exact  because  there  am  be  more  than  one  correct  regular  expression  that  we  could  use 
for  a  set  ot  v^  bedules,  and  each  one  can  lead  to  a  different  estimate  of  overhead.  How- 


aOURCE-L£VEL  DEBUQOWa  OF  AUTOMATICiULLY  PARALLEUZED  PnOORAIM  117 


Umitaliom  ol  ttirMid  apOttiiig 


ever,  we  b^eve  dw  cfaoottiig  a  regnkr  expre»k»  dial  leads  10  ae  aocaraie  answer  is 
siiaigiuforwaid;  fbr  each  of  die  (Ustfibutkns  in  die  previoes  sectkn.  we  used  the  most 
‘‘oaonT  tegular  expresskn,  and  obtahied  the  right  answer.  Fbr  the  distributioiis  whh 
low  ovetfaead.  it  is  only  the  beginning  and  end  of  the  schedules  that  are  not  known  at 
ocBiipile*tiiDe,  the  cost  of  the  middle  dominates  and  can  be  completely  determined  at 
ocmiiiilB^iine.  For  dm  user  mapped  (Ustributioo.  nothing  is  known  at  compUe-iime  abom 
its  behavkx;  so  it  is  simple  to  model  accurately  as  well. 

Hm  overhead  that  we  eatiaBaie  from  a  regular  gqitession  is  pessimistic  for  several  lea- 
aona.  Rrst.  the  tegular  expiesaioii  that  we  choose  might  inciiide  schedules  that  are  not 
possible  eiecudons  of  a  pcogtam.  The  unnecessary  schedules  might  require  mote  fiexi- 
bUity  in  the  ssslgnineiit  that  adds  extra  overhead.  A  trivial  example  of  an  overly  general 
tegi^  expression  is  (0 1 1 1 2 1 ...  n)*  which  inchides  all  possible  schedules  for  proces* 
son  0  through  a,  hence  it  could  be  the  spedficadoii  for  any  set  of  schethiks.  Using  this 
regular  expression  as  the  specificadoo  of  the  set  of  schedules  when  we  really  have  a 
blodt  distributioo  would  to  the  incorrect  conclusion  that  there  is  signi&cant  over¬ 

head. 

Wb  might  use  a  regular  expressioo  that  includes  extra  schedules  for  two  reasons.  First,  a 
regular  expressioo  has  limitadoos  on  the  languages  that  h  can  generate.  For  example,  a 
regular  expression  can  have  a  sequence  repeat  a  fixed  niunber  of  times  or  an  unlimited 
number  of  dines,  but  it  is  not  possible  to  specify  a  tegular  expcesskm  where  the  number 
of  tqxtidoDS  is  the  same  as  tte  number  of  tepeddons  in  another  part  of  the  string.  An 
example  where  this  applies  to  schedules  is  the  block  distribudcn,  where  the  size  of  the 
blocks  can  change  fiom  program  to  progm,  but  are  oonsant  across  processors  for  one 
particular  program.  When  we  specify  the  set  of  schedules  for  a  block  distribution,  we 
cannot  require  that  the  block  sizes  be  the  same  across  ptocesson. 

Another  reason  we  might  use  an  overly  general  regular  expression  is  that  it  is  too 
tedious  to  exactly  specify  (he  set  of  sc^ules.  hi  practice,  using  overly  general  sched¬ 
ules  is  not  a  problem.  For  all  but  the  mailed  distribution,  we  bave  used  an  overly  gen¬ 
eral  schedule,  but  we  still  otmcluded  that  the  overhead  was  low. 

Even  if  we  find  a  regular  expressicn  that  exactly  defines  the  set  of  schedules  for  a  trms- 
fonnatioa,  it  is  possible  that  there  is  another  regular  expression  that  defiues  the  same  set 
of  schedules,  but  has  a  lower  overhead.  For  example,  the  regular  expresskms 
1*1*1*1*1*1*  and  1*  define  the  same  language,  but  the  first  one  has  more  overhead  for 
short  schedules  because  we  must  skip  over  more  closures. 


7.5  Summary _ 

In  this  chapter,  we  describe  data  independent  and  data  dependent  assignments  of  opera¬ 
tions  to  processors.  Thread  splitting  directly  supports  data  independent  assignments.  If  a 
transfonnation  uses  a  data  dependent  assignment,  tbe  source  program  must  first  be 
rewritten  so  (hat  a  data  independent  assignment  can  be  used.  The  convetsion  can  add 
overhead  to  the  program.  We  explain  why  this  overhead  occun  and  describe  how  to 
construct  a  0  program  chat  minimizes  tbe  overhead.  We  (hen  descnlie  a  method  to  mea¬ 
sure  tbe  overhead  for  a  single  execution  of  a  program,  and  use  it  to  estimate  the  over¬ 
head  fix’  the  Gommoo  distributions.  In  geueraJ,  if  tbe  assigmnent  Is  knoum  at  coapile- 


11t 


SKXJRCe-UntiLOeBIKKaMQOFaUTOIIAII^^  PARALIEUZED  mOQRAUS 


SumtiMry 


time,  dm  cbe  ovotead  is  low.  &i  specific,  die  block,  cydic,  and  biock<yclic  distribu¬ 
tions  bawe  Utile  overbead,  while  it  is  expected  ibat  the  user  mapped  and  dynamicaUy 
load  balanced  distributions  will  have  a  great  deal  of  oveitead. 


«ouBct-igveL  oeBuotaNo  or  AurowA-ncaixY  pahai  ihbpp  phoqrams  119 


CHAPTER  8 


Conclusions 


Reseaicb  in  paraileliziiig  compilers,  especially  for  distributed  memory  machines,  still 
has  far  to  go  before  it  will  have  the  level  of  maturity  found  in  compilers  for  sequential 
machines  today.  Debuggers  for  these  compilers  are  an  even  newer  topic  of  resemcb. 
This  thesis  identifies  some  basic  methods,  but  there  is  still  much  work  to  do  in  extend¬ 
ing  the  types  of  transformatioas  and  source  languages  that  can  be  debugged.  We  hope 
that  good  debugging  tools  will  he^  move  parallel  progFanuning  into  the  mainstream. 

This  chapter  summarizes  smne  of  the  cooclusioas  about  debugg^nlity  of  transforma- 
tions  in  Section  8.1.  In  Section  8.2,  we  outline  stxne  cooclusimis  ctmceming  the  con- 
structioQ  of  debuggers.  The  aaiof  contributions  are  listed  in  Section  8.3.  In  Section  8.4, 
some  areas  for  future  work  are  identified. 


8.1  Debuggabillty  of  distributions _ 

In  CHAPTER  S,  we  showed  that  the  common  distributioo  models:  block,  cyclic,  and 
blodc-cyclic  can  be  debugged  at  the  source-level.  The  following  characteristics  make  it 
possible:  the  order  (rf  execution  of  operations  on  a  single  iRocessor  is  preserved,  and  the 
values  that  are  computed  are  the  .same  in  the  source  and  target  programs.  The  user 
mapped  distribmion  also  obeys  these  propenies,  but  since  the  assignment  of  iterations  to 
processors  is  not  know  at  compile-time,  it  cannot  be  implemented  efficiently  with  our 
method. 

From  a  debugging  viewpoint,  the  round-robin  style  scheduling  of  the  cyclic  distribution 
has  two  advantages  over  the  block  distributitm.  The  benefits  result  from  the  fact  that  tbe 
order  that  work  is  completed  in  the  parallel  program  is  closer  to  the  order  that  operations 
are  executed  in  the  sequential  prograna.  This  is  advantageous  when  the  user  intemipis 
the  parallel  program;  there  are  less  early  operations.  It  is  also  better  when  trying  to  tun 
the  program  so  that  it  will  stop  at  a  particular  virtual  time;  more  parallelism  can  be 


SOURCE-UEVEL  DCBUGGMQ  OF  AUTOIUT1CAU.Y  PARAUEdZED  PROGRAMS 


121 


CondustoiM 


oqdoted  to  lOKli  that  dine  qindUy.  If  consistent  bradqpoiiitt  are  used,  (tiai  tfae  bkxk 
dbirflwiioo  is  anterior.  As  ihoere  in  CHAP’IER  6,  die  lound  robin  .tdiertMting  of  iiaa- 
dons  of  dre  cydk  (Usmlwiioo  requires  mudi  more  syocbraoizaiioD  dun  the  Mode  dis- 
iritioiioo. 

i 

8^  Building  (tebuggera _ _ 

ta  addhion  m  idead^fii^  the  types  of  transfonnatioos  that  can  be  ddwgged,  our  woffc 
Iw  also  made  clear  what  infonnadon  the  debugger  needs  from  the  compiler  and  bow 
ddwggeis  can  be  stiuctnred  to  be  iDdependeat  of  a  particular  paraUdiziiqt  tiansfonna' 
don. 

First,  the  ddwgger  must  know  the  structural  reladonship  between  variabies  and  lines  in 
the  source  and  target  programs.  This  correspondeace  cannot  be  represented  by  a  simple 
table  kx)k-up  because  the  relationship  is  dynamic  and  may  requite  a  compmadou  to  be 
performed.  An  example  of  a  dynamic  case  is  when  there  are  muldple  copies  of  a  loop 
counter.  An  example  that  requires  computadon  is  when  an  array  is  distributed  and  it 
requites  computation  to  cranpute  the  processor  index  and  local  address. 

Second,  the  debugger  must  know  the  source  ordering  of  operations  in  the  parallel  pro* 
gram.  From  that  information,  the  debugger  can  detennine  what  order  to  present  events 
such  as  breakpoints.  It  also  uses  the  order  to  compute  late  and  early  operations  which  is 
then  used  to  determine  if  a  command  should  be  (hsallowed.  The  source  ordering  can  be 
specified  with  a  program  (the  P  program).  If  the  P  program  has  a  data  independent 
assignment,  the  debugger  can  take  advantage  of  static  information.  However;  if  a 
dynamic  distribution  is  necessary,  then  data  dq)endeat  assignment  must  be  supported 
effickndy  by  the  debugger. 


8.3  Contributions 


Source>levd  debugging  for  automatically  parallelized  programs  is  a  new  area  of  study. 
Our  wofk  has  idmitified  the  general  problems  that  a  debugger  must  solve  and  has  pro¬ 
vided  solutions  for  some  of  the  common  paralleJizing  transformations. 

Htst.  we  have  identified  the  basic  services  that  a  source-level  debugger  must  peifonn. 
They  are  structural  mapping,  wbiefa  is  necessary  for  sequential  debugger  as  well,  and 
dynamic  order  restoration,  which  is  unique  to  MIMD  executioa  Second,  we  have  devel¬ 
oped  a  methodology  that  aOows  us  to  separate  structural  mapping  from  dynamic  order 
restoration  when  constructing  a  debugger.  This  allows  us  to  isolate  the  part  of  the 
debugger  that  manages  patalldism  in  a  module  that  is  independent  of  the  distribution. 
lUrd,  we  have  develop^  techniques  for  nnpiementing  dynamic  order  restoration.  TlUs 
indudes  mechanisms  f^or  virtual  time,  handling  out  of  order  events,  disallowing  com¬ 
mands,  and  roil  forward. 


122 


90URce-LEViLoaijaaMaoFAUTOau7icAU.r  •aralleuzbo  proorams 


Futufli  woffc 


8.4  Future  wofk _ 

The  woric  presented  in  this  thesis  can  be  extended  in  many  ways.  In  this  sectkxi,  we  dis¬ 
cuss  some  of  the  areas  that  need  future  work. 

SA1  Oistributiofi  models 

The  duead  splitting  model  is  suffidently  general  to  handle  the  common  disuibutioas 
where  the  alignment  of  operations  to  processofs  is  known  at  compile-time.  Some  sys¬ 
tems.  patticularty  those  that  support  dynamic  load  balancing,  use  more  dynamic  distri- 
butioos  that  cannot  be  detenni^  at  compUe-thne.  lb  effidently  support  these  types  of 
distributions,  thread  splitting  must  be  adapted  to  allow  data  dependent  assignments. 
When  the  assignment  is  data  independent,  cmnputing  the  virtual  time  and  the  set  of  late 
and  early  tqnatioas  only  requires  information  ^)out  the  control  flow  of  the  i»t>gram. 
Adding  the  capability  to  handle  data  dependent  assignments  requires  that  the  compiler 
give  more  infonnation  to  the  debugger  about  the  program  behavior. 

8.4.2  Sof’c^  languages  with  explicit  paralMism 

Some  examples  of  source  languages  with  explicit  parallelism  are  data  parallel  languages 
^..idudlng  Imguages  with  vector  operations)  and  languages  with  parallel  loops.  When 
programs  with  explicit  parallelism  are  executed  efficiently  on  MIMD  machines,  they 
can  require  dynamic  orto  restoratton.  The  techniques  that  are  used  for  dynamic  order 
restoration  when  the  source  language  is  sequential  must  be  extended  when  the  source 
language  has  explicit  parallelism. 

84.3  Roll  back  machanisma 

In  the  thesis,  we  assume  that  roll  forward  is  inexpensive  and  that  roll  back  is  more 
costly.  This  is  because  roll  back  is  implemented  by  re-executing  the  program  from  the 
beginning.  If  roll  back  were  inexpensive,  some  of  the  basic  methods  for  implementing 
dynamic  order  restoratioo  might  be  differenc  As  described,  the  debugger  tries  to  pro¬ 
vide  truthful  behavior  in  the  presence  of  early  operations.  If  roll  bade  is  inexpensive,  the 
debugger  can  use  it  to  eliminaie  early  qperations  from  a  program  state  in  the  same  way 
that  the  debugger  can  use  roil  forward  to  eliminate  late  (qteratioDs. 

Reversible  execution(Pan  SSlfTolmacb  91](Feidman  88],  which  has  been  proposed  as 
a  method  for  debugging  parallel  programs,  could  be  used  to  make  roil  back  less  costly. 
Interactioa  between  the  debugger  and  reverse  execution  facility  presents  some  interest¬ 
ing  possibilities.  Fbr  example,  a  general  reverse  execution  facility  must  be  able  to  roll 
badt  to  any  previous  state  of  the  program.  However,  a  source-level  debugger  only  needs 
to  roll  back  to  amuch  smaller  set  of  states.  This  infonnation  could  permit  the  reverse 
executitm  facility  to  reduce  the  need  for  storage. 

8A4  Computation  of  Miiy  operatione 

In  tbe  thesis,  we  describe  conservative  methods  for  computing  the  set  of  eariy  opera¬ 
tions.  If  we  use  more  aggressive  methods,  we  decrease  the  number  of  falsely  lalxlled 
early  operations.  This  decreases  tbe  chances  of  disallowing  a  command,  which  in  turn 
may  reduce  the  need  for  re-executitm  of  tbe  program. 


SOURCC-UmOeBUwGWQOFAUTOIU11CAUYPARAULELI2EDI)hOCSRAMS  123 


ConeiiMiem 


Ws  disctm  two  ways  tbat  tbe  (kbugger  caa  be  moce  exaa  in  the  conqMitatioii  of  eariy 
qKtatkns.  Hist,  if  aconditional  bas  already  been  executed,  we  assume  that  both 
biancfaes  of  a  conditioaal  were  executed  If  a  loop  has  already  been  e«cuted,  we 
assiane  that  an  infinite  number  of  iteratioos  were  executed  lliis  informaticMi  is  clearly 
conservative  because  a  ccmditional  only  executes  one  btanch  and  a  loop  only  executes  a 
finite  set  of  iteratioos.  If  the  debugger  records  dytuanic  information  about  the  program 
while  it  is  executing,  for  example  btaodnng  information  and  loop  cotmts,  then  it  could 
more  precisely  compute  the  set  of  opeiatioos  that  were  executed  which  would  in  tun 
allow  the  ddragger  to  more  precisely  compute  the  set  of  early  opefadons. 

Second  if  the  ddMgger  has  predse  information  about  which  loop  iteratams  are  exe* 
cued  it  can  use  diat  information  to  determine  which  array  elements  the  loop  iterations 
have  accessed  By  default,  the  debugger  must  assume  that  all  ehanents  of  an  array  are 
accessed  if  a  loop  body  accesses  an  array,  the  eiemeu  that  a  loop  iteration  accesses  is 
only  dependeuan  the  iteiatioD  number,  then  the  ddnigger  can  dettrmine  the  exaa  set 
of  elemoics  chat  have  been  accessed  Deciding  if  array  subscripts  are  only  dependent  on 
iteration  numbers  requires  data  flow  informatioo  such  as  the  set  of  loop  invariant  vari¬ 
ables  and  induction  variables. 

8A5  Pwfomtnncn  debugging 

Another  interesting  topic  is  source-level  performance  debugging.  Alter  the  user  has 
debugged  the  program,  the  next  step  is  usually  tuning.  For  parallel  programs,  tuning  is 
especially  hnportau  because  speed  is  the  motivation  for  using  a  parallel  machine. 
Underabrnding  theperfonnaoce  of  apatalkl  program  requires  that  the  user  understand 
how  computation  and  data  are  distributed  However,  it  isn’t  necessary  tbat  the  user 
know  aU  of  the  details  of  the  distribution  and  a  source-level  view  while  tuning  the  pro¬ 
gram  can  ease  some  of  the  programmer’s  burden.  Many  of  the  techniques  presented  in 
this  thesis  for  relating  sequential  source  and  parallel  taiget  would  also  be  useful  for  per¬ 
formance  debugging  in  this  context 


124 


aouRce-LcyEL  orauoGMa  or  automaticaixy  paraueuzeo  proqrams 


BMography 


[Aho  86] 

[Aral  88] 

[Bacon  91] 

[Bailey  88] 

[Baker  77] 
[Balasundaram 

[Bates  83] 

[Booth  86] 
[Brooks  92] 

[Bruegge  91] 


Aho,  A-,  Sethi,  R.,  and  Ullman,  J,  Compilers:  Principles,  Techniques,  and 
Tools.  Addison-Wesley,  Reading.  Massachusetts,  1986. 

Aral,  Z.  and  Gertner,  L  High-level  debugging  in  ParasighL  Workshop  on  Paral¬ 
lel  and  Distributed  Debugging,  pages  151-162.  ACM,  Madison,  Wisconsin, 
May.  1988. 

Bacon,  D.F.  and  Goldstein,  S.C.  Hardware-assisted  replay  of  multiprocessor 
programs.  Proceedings  of  the  ACM/ONR  Workshop  on  Parallel  and  Distributed 
Debugging,  pages  194-206.  ACM,  Santa  Cruz,  California,  May,  1991. 

Bailey,  M.L.,  Socha,  D.,  and  Notkin,  D.  Debugging  Parallel  Programs  using 
Graphical  Views.  Proceedings  of  1988  International  Conference  on  Parallel 
Processing,  pages  46-49.  IEEE,  August,  1988. 

Baker,  B.S.  An  algorithm  for  structuring  programs.  Journal  of  the  ACM. 
24(1):98-120,  1977. 

89]  Balasundaram,  V.,  Kennedy,  KL,  Kremer,  U.,  McKinley,  iC.,  Subhlok,  J.  The 
ParaScope  editor  an  interactive  parallel  programming  tool.  Proceedings  of 
Supercomputing  '89,  pages  540-550.  Reno,  NV,  November,  1989. 

Bates,  P.  and  Wileder,  J.  C.  An  Approach  to  High-Level  Debugging  of  Distrib¬ 
uted  Systems.  Johnson,  M.  S.  (editor).  Proceedings  of  the  ACM  SIGSOFT/SIG- 
PLAN  Software  Engineering  Symposium  on  High-Level  Debugging,  pages  107- 
111.  ACM,  Pacific  Grove,  CaHfomia,  August,  1983. 

Booth,  M.  and  Misegades,  K.  Microtasldng:  a  new  way  to  harness  multiproces¬ 
sors.  Cray  Channels.  8(2):24-27,  Summer,  1986. 

Brooks,  G.,  Hansen,  G.  J.,  and  Simmons,  S.  A  new  approach  to  debugging  opti¬ 
mized  code.  Proceedings  of  die  ACM  SIGPLAN  '92  Conference  on  Program¬ 
ming  Language  Design  and  Implementation,  pages  1-11.  ACM,  San  Francisco, 
California,  June,  1992. 

Bruegge,  B.  A  portable  platform  for  distributed  event  environments.  Proceed¬ 
ings  of  the  ACM/ONR  Workshop  on  Parallel  and  Distributed  Debugging,  pages 
184-193.  ACM,  Santa  Cruz,  California,  May,  1991. 


SOURCE  UEVEL  DEBUQQtNQ  OF  AUTOMATICAU.Y  PARALLBLJ2ED  CODE 


125 


[Callahan  90] 

[at  90] 
[ChattBijee  91] 

[Cohn  91] 
[Copperman  90] 

[Coutant  88] 

[Cytron  90] 
[Dinning  91] 

[Emrath  89] 
[Feiler  82] 

[Feldman  88] 

[Form  88] 


Callahan,  D.,  Kennedy,  K.,  and  Subhlok,  J.  Analysis  of  event  synchronization 
in  a  parallel  programming  to.  Second  ACM  Sigplan  Symposium  on  Principles 
and  Practice  of  Parallel  Programming,  pages  21-30.  ACM,  Seattle,  WA, 
Match,  1990. 

CF77  Compiling  System,  Volume  1:  Fortran  Reference  Manual  Cray  Research, 
Inc.,  1990. 

Chattetjee,  S.  Compiling  dam-parallel  programs  for  efficient  execution  on 
shand-memory  multiprocessors.  PhD  thesis,  Carnegie  MeUon  University, 
October,  1991. 

Cohn,  R.  Proceedings  of  the  ACM/ONR  Workshop  on  Parallel  and  Distributed 
Debugging,  pages  132-143.  Santa  Cruz,  California,  May,  1991. 

Copperman,  M.and  McDowell,  C.  E.  Detecting  Unexpected  Data  Values  in 
Optimized  Code.  Technical  Report  90-56,  Computer  Research  Laboratory,  Uni¬ 
versity  of  California  at  Santa  Cruz,  October,  1990. 

Coutant,  D.  S.,  Meloy,  S.,  and  Ruscetta,  M.  DOC:  A  practical  approach  to 
source-level  debugging  of  globally  optimized  code.  Proceedings  of  the  1988 
Conference  on  Programming  Language  Design  and  Implementation,  pages 
125-134.  Atlanta,  Georgia,  June,  1988. 

Cytron,  R.,  Lipids,  J.,  Schonbeig,  E.  Proceedings  of  Supercomputing  '90,  pages 
398-406.  New  York,  NY,  November,  1990. 

Dinning,  A.  and  Schonbeig,  E.  Detecting  access  anomalies  in  programs  with 
critical  sections.  Proceedings  of  the  ACM/ONR  Workshop  on  Parallel  and  Dis¬ 
tributed  Debugging,  p^es  85-96.  ACM,  Santa  Cruz,  California,  May,  1991. 

Emrath,  P.A.,  Ghosh,  S.,  and  Padua,  D.A.  Proceedings  of  Supercomputing  '89, 
pages  580-588.  Reno,  NV,  November,  1989. 

Feiler,  P.H.  A  Language-Oriented  Interactive  Programming  Environment  Based 
on  Compilation  Technology.  PhD  thesis,  Camegie-Mellon  University,  May, 
1982. 

Feldman,  S.  and  Brown,  C.  IGOR;  A  System  for  Program  Debugging  via 
Reversible  Execution.  Workshop  on  Parallel  and  Distributed  Debugging,  pages 
112-123.  ACM,  1988. 

Forin,  A.  Debugging  of  heterogeneous  parallel  systems.  Workshop  on  Parallel 
and  Distributed  Debugging,  pages  130-140.  ACM,  Madison,  Wisconsin,  May, 
1988. 


128 


aOUm:6tJEVEL0e«JQGaMQOF/unDIIAT1CAU.YPi>HM.i£LgELCOO6 


[Fortrand  92] 
[Francioni  91  j 

[Gupta  88] 

[Gupta  92] 

[Heath  91] 
[Hennessy  82] 
[JelEferson  85] 
[Jefferson  87] 

[Knobe  90] 

[Leblanc  87] 

[Li  91] 

[Mehrotra  90] 


Fox,  G.,  Hiranandani.  S.,  Kennedy,  KL,  Koelbel,  C,  Kremer,  U.,  Tseng,  C,  Wu, 
M.  FORTRAN  D  Language  Specification.  1992. 

Fnmcioni,  J.,  Albright,  L.,  and  Jackson,  J.  Debugging  parallel  programs  with 
sound.  Proceedings  of  the  ACM/ONR  Workshop  on  Parallel  and  Distributed 
Debugging,  pages  68-75.  ACM,  Santa  Cruz,  California,  May,  1991. 

Gupta,  R.  Debugging  code  reorganized  by  a  trace  scheduling  compiler.  Karta¬ 
shev,  LJ*.  and  Kartashev,  S.L(editors),  Supercomputing  '88.  International 
Supercoraputing  Institute,  Inc.,  1988. 

Gupta,  M.  and  Banerjee,  P.  Demonstration  of  automatic  data  partiooning  tech¬ 
niques  for  parallelizing  compilers  on  multicoraputers.  IEEE  Transactions  on 
Parallel  and  Distributed  Systems.  3(2):179-l93,March,l 992. 

Heath,  M.T.  and  Etheridge,  J.A  Visualizing  the  performance  of  parallel  pro¬ 
grams. /£££  8(5):29-39,  September,  IS^l. 

Heimessy,  J.  L.  Symbolic  Debugging  of  Optimized  Code.  ACM  Transactions  on 
Programming  Languages  and  Systems.  4(3):323-344,  July,  1982. 

Jefferson,  D.  Virtual  Tune.  ACM  Transactions  on  Programming  Languages  and 
Systems.  7(3):404-425,  July,  1985. 

Jefferson,  D.  and  others.  Distributed  Simulation  and  the  Time  Warp  Operating 
System.  Technical  Report  TR  CSD-870042,  Department  of  Computer  Science. 
UCLA,  1987. 

Knobe,  K.  and  Lukas,  J.,  and  Steele,  G.  Data  Optimization:  Allocation  of  Arrays 
to  Reduce  Communication  of  SIMD  Machines.  Journal  of  Parallel  •  J  Distrib¬ 
uted  Computing.  8102-118,  February,  1990. 

LeBIanc,  T.  J.  and  Mellor-Crummey,  J.  M.  Debugging  Parallel  Programs  with 
Instant  Replay.  IEEE  Transactions  on  Computers.  C-36(4):47 1-482,  April, 

mi. 

Li,  J.  and  Chen,  M.  The  data  alignment  phase  in  compiling  programs  for  distrib- 
ut^-memory  machines.  Journal  of  Parallel  and  distributed  computing. 
13(2):213-221.  October,  1991. 

Mehrotra,  P.  and  Van  Rosendale,  J.  Programming  cUstributed  memory  architec¬ 
tures  using  Kali.  Technical  Report  ICASE  Report  No.  90-69,  Institute  for  Com¬ 
puter  Application  in  Science  and  Engineering,  October,  1990. 


SOURCE  UEVEL  OEBUOQINa  OF  AUTDIIA11CALLY  PARAUEUZED  CODE 


127 


[Miller  88] 

[Neizer  91] 

[Padua  86] 
[Pan  88] 
[Pineo  91] 

[Poly  89] 

[Ribas  90] 

[Steele  84] 
[Stevens  90] 

[Sussman  91] 

[Tolniach  91] 

[Tseng  89] 
[Wang  90] 


Miller,  and  Choi,  J.-D.  A  mechanism  for  efficirat  debugging  of  parallel 
programs.  SIGPLAN  '88  Conference  on  Programming  Language  Design  and 
Implementation^  pages  22-24.  ACM,  Atlanta,  GA.  June,  1988. 

Netzer,  RJi  and  Miller,  B.P.  Improving  the  accuracy  of  data  race  detection. 
Third  ACM  SIGPLAN  Symposium  on  Principles  and  Practice  of  Parallel  Pro¬ 
gramming,  pages  21-24.  ACM,  Williamsburg,  VA,  July,  1991. 

Padua,  D.  and  Wolfe,  M.  Advanced  Compiler  Optimizations  for  Supercomput¬ 
ers.  Communications  <rfthe  ACM.  29(29):  1184- 1200,  December,  1986. 

Pan,  D.Z.  and  Linton,  M.A.  Supporting  reverse  execution  of  paraUel  programs. 
Workshop  on  ParaUel  and  Distributed  Debugging,  pages  124-9.  ACM,  1988. 

Pineo,  P.  P.  and  Sofia,  M.  L.  Debugging  parallelized  code  using  liberation  tech¬ 
niques.  Proceedings  of  the  ACM/ONR  Workshop  on  Parallel  and  Distributed 
Debugging,  pages  108-119.  Santa  Cruz,  California,  May.  1991. 

Polychronopoulos,  C.D.,  Girkar,  M.,  Haghighat,  M.R.,  Lee,  C.L.,  Leung,  B., 
and  Schouten,  D.  Parafrase-2:  an  environment  for  parallelizing,  partitioning, 
synchronizing,  and  scheduling  programs  on  multiprocessors.  International 
Joumal  TfHigh  Speed  Computing.  l(l):45-72.  May,  1989. 

Ribas,  H,  B.  Automatic  Generadon  of  Systolic  Programs  from  Nested  Loops. 
PhD  thesis,  Carnegie  Mellon  University,  June,  1990. 

Steele  Jr,  O.L.  Common  Lisp.  Digital  Press,  1984. 

Stevens,  K.G.,  Jr.  and  Sykora,  R,  The  Cray  Y-MP:  a  user’s  viewpoint  COMP- 
CON  Spring  '90,  pages  12-15.  IEEE,  San  Francisco,  CA,  February,  1990. 

Sussman,  Alan.  Model-Driven  Mapping  of  Computation  onto  Distributed  Mem¬ 
ory  Parallel  Computers.  PhD  thesis,  Carnegie  Mellon  University,  September, 
1991. 

Tolmach,  A.P.  and  Appel,  A.W.  Debuggable  Concurrency  Extensions  for  Stan¬ 
dard  ML.  Proceedings  of  die  ACM/ONR  Workshop  on  Parallel  and  Distributed 
Debugging,  pages  120-131.  ACM,  Santa  Cruz,  California,  May,  1991. 

Tseng,  P.  S.  A  Parallelizing  Compilerfor  Distributed  Memory  Parallel  Comput¬ 
ers.  PhD  thesis,  Carnegie  Mellon  University,  May,  1989. 

Wang,  J.-Z.  Debugging  parallel  programs  by  trace  analysis.  Technical  Report 
UCSC-CRL-90-11,  Computer  Research  Laboratory,  University  of  California, 
Santa  Cruz,  Match,  1990. 


12S 


aouwce  ubvel  mebuqqino  or  AtnDmTKAU.Y  paftaLLgiowp  cooe 


[Wiuren  78] 
[Wholcy  91] 

[Zellweger  84] 
[Zemik  91] 


Warren,  H.S.  and  Schlaeppi,  HJP.  Design  cfthe  FDS  interactive  debugging  sys¬ 
tem.  Technical  Report  RC7214.  IBM  Research  Report,  July,  1978. 

Wholey,  S.  Automatic  data  mapping  for  distributed-memory  parallel  comput¬ 
ers.  PhD  thesis,  Carnegie  Mellon  University.  School  of  Computer  Science, 
1991. 

Zellweger,  P.  T.  Interactive  Source-Level  Debugging  of  Optimized  Programs. 
PhD  thesis.  University  of  Caltfomia  at  Berkeley,  May,  1984. 

Zemik,  D.  and  Rudolph,  L.  Animating  work  and  time  for  debugging  parallel 
programs  -  foundation  and  experience.  ProceetUngs  of  the  ACM/ONR  Workshop 
on  Parallel  and  Distributed  Debugging,  pages  46-56.  ACM,  Santa  Craz,  Cali¬ 
fornia,  May,  1991. 


SOURCe  UEVEL  OEBUOGiNQ  OF  AUTOMATICAU.Y  PARAUEUZED  CODE 


129 


